Eventual Consistency Testing

I’ve been recently involved in testing a massively scalable application based on an eventual consistency  framework called Voldemort.

The key articles on Dynamo and eventual consistency are:

Dynamo has inspired a variety of frameworks based on distributed hash table principles such as Cassandra, Voldemort, Mongo etc. What all these tools strive to address is the inherent limit to massive scalability with traditional relational databases. Hence the name “No SQL”.

How is Dynamo tested?

All this sounds fine, but the real question is: does this work in real life? Unfortunately Amazon has not exposed the Dynamo source code, and except that it is written in Java, little is known. As a pragmatic sort of fellow, I am always keen on knowing the nuts and bolts of new-fangled solutions. How does Amazon certify builds? What is Amazon’s test framework for such a massively scalable framework such as Dynamo? What sort of tests do they have? How do they specify and test their SLAs? How do they test the intricate and complex logic associated with quorum-based logic as cluster nodes are brought up and down? I could well imagine that the complexity of such a test environment exceeding the complexity of the application itself.

Embedded and Standalone Execution Contexts

One of the nice things about the Voldemort project is its strong emphasis on modularity and mockable objects. The Voldemort server has the capability of being launched in an embedded mode, and this greatly facilitates many testing scenarios. However, this in no way replaces the need to test against an actual standalone server. Embedded vs. standalone testing is a false dilemma. The vast majority of test cases can and should be run in both modes. Embedded for ease of use, but standalone for truer validation since it more closely approximates the target production environment. So the first step was to create an “Execution Context” object that encapsulated the different bootstrapping logic.

InitContext Interface.

public interface InitContext {
  public void start() throws IOException ;
  public void stop() throws IOException ;
  public VoldemortStoreDao getTestStoreDao() ;
}

EmptyContext for standalone server. Nothing much needs to be done since server is launched externally.

public class EmptyInitContext implements InitContext
{
  private VoldemortStoreDao storeDao ;

  public EmptyInitContext(VoldemortStoreDao storeDao) {
    this.storeDao = storeDao ;
  }

  public void start() throws IOException {
  }

  public void stop() throws IOException {
  }

  public VoldemortStoreDao getTestStoreDao() {
    return storeDao ;
  }
}

EmbeddedContext for an embedded Voldemort server that uses an embedded Berkeley DB store.

public class EmbeddedServerInitContext implements InitContext
{
  private VoldemortServer server ;
  private TestConfig testConfig ;
  private VoldemortStoreDao testDao ;
  private boolean useNio = false ;

  public EmbeddedServerInitContext(TestConfig testConfig) {
    this.testConfig = testConfig ;
  }

  public void start() throws IOException {
    String configDir = testConfig.getVoldemortConfigDir();
    String dataDir = testConfig.getVoldemortDataDir();
    int nodeId = 0 ;
    server = new VoldemortServer(
      ServerTestUtils.createServerConfig(useNio, nodeId, dataDir,
        configDir + "/cluster.xml", configDir + "/stores.xml",
        new Properties() ));
    server.start();

    StoreRepository srep = server.getStoreRepository();
    List stores = srep.getAllLocalStores() ;
    for (Store store : stores) {
      Store lstore = VoldemortTestUtils.getLeafStore(store);
      if (lstore instanceof StorageEngine) {
        if (store.getName().equals(testConfig.getStoreName())) {
          StorageEngine engine = (StorageEngine) lstore ;
          StorageEngineDaoImpl dao = new StorageEngineDaoImpl(engine);
          testDao = dao ;
          break;
          }
        }
      }
  }

  public void stop() throws IOException {
    ServerTestUtils.stopVoldemortServer(server);
  }

  public VoldemortStoreDao getTestStoreDao() {
    return testDao ;
  }
}

Server Cycling

One important place where embedded and standalone testing logic do differ is in server cycling. This is especially important when testing eventual consistency scenarios. Server cycling refers to the starting and stopping of server nodes. In embedded mode this is no problem since everything is executing inside one JVM process. When the servers are separate processes, the problem becomes significantly more difficult. Stopping a remote Voldemort server actually turns out to be easy since Voldemort exposes a JMX MBean with a stop operation. Needless to say this technique can not be used to start a server! In order to launch a server, the test client has to somehow invoke a script on a remote machine. The following steps need to done:

  • Use Java Runtime.exec to ssh a script on remote machine
  • Script must first check that a server is not running – if it is an error is returned
  • Script calls voldemort-server.sh
  • Script waits an indeterminate amount of time to allow the server to start
  • Script invokes “some operation” to ascertain that the server is ready to accept requests

As you can see each step is fraught with problems. In local embedded mode this series of complex steps is subsumed in the blocking call to simply create a new in-process object. In standalone mode, the wait step is problematic since there is no precise amount of time to wait. Wait and then do what to determine server liveliness? Invoke an operation? This would/could affect the integrity of the very operation we are testing! One potential solution is to invoke a JMX operation that would serve the purpose of a liveliness check. Assuming all goes well, all of this takes time and for a large battery of tests the overall execution time is significantly increased.

Eventual Consistency Test Example

Let us look at some examples. Assume we have a three node cluster with N=3,W=2,R=2. N is the number of nodes, W is the number of writes that must succeed and R is the required reads. For example, for a write operation the system will try to write the data to all nodes. If two (or three) succeed the operation is considered as succesful.

Get – One Node Down

  • Call put(K,V)
  • For each node in cluster
    • Stop node
    • Call V2=get(K)
    • Assert that V==V2
    • Start node

This logic needs to be executed against all thirteen Voldemort operations: get, put, putIfNotObsolete, delete, etc. Whew! Now imagine if requirements are to test against two different cluster configurations, a 3/2/2 and 5/3/3!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: