On the Nature of Failure in Distributed Systems

I just finished reading an excellent article by Jay Kreps of Voldemort fame titled Getting Real About Distributed System Reliability. This is a must read for anyone seriously interested in the operational reliability of NoSQL or Big Data systems. The main points of the article are that node failure is not independent and the importance of having solid operational procedures for your production environment.

The most frequent reasons for node failures are not random network blips but  software bugs that unfortunately often manifest themselves at the same time on all the deployed nodes.  Michael Stonebraker makes a similar point in Clarifications on the CAP Theorem and Data-Related Errors where he describes Bohrbugs – “when multiple data base replicas are available, the same transaction issued to the replicas will cause all of them to crash”. Even though the theoretical probability of node failure is higher as the cluster size increases, this is not the main problem for distributed system reliability.

Kreps  points out that the theoretical conclusions regarding reliability provided by the CAP theorem do not accurately translate into run-time reliability. Stonebraker makes a similar point: “the model does not deal with several important classes of errors, which a real world system administrator must cope with”.

This is not so much a problem with the CAP theorem per se. Satisfying the CAP theorem is a necessary but not sufficient condition. It is more about system design rather than implementation. On a side note regarding the CAP theorem, there is an excellent discussion by Daniel Abadi in Problems with CAP, and Yahoo’s little known NoSQL system of some of the problems of differentiating CAP’s Availability and Partion-tolerance, i.e what exactly is the distinction between an CA and CP system?

Kreps underscores the importance of operational procedures since no code is bug free. This is especially true of relatively immature NoSQL systems since they are very young and have not gone through a long hardening phase such as traditional databases. In any complex long-running system you are bound to have outages no matter how good your code is, and the real question to be asked is: how do you recover from failures? Of course this is not a novel idea, but it is worth keeping focus on this point especially when it comes from a reliable NoSQL authority.

The importance of system reliability and recovery is underscored by the rather embarrassing recent Azure outage due to February’s leap day. Due to a minor software bug, Azure was unable to launch new elastic VMs on February 29, 2012. This didn’t just happen on one machine but across the whole cluster. In other words, the single point of failure was a software bug! All the redundant hardware and power sources weren’t much good in this situation! See a good recap:  The Azure Outage: Time Is A SPOF, Leap Day Doubly So. The two main points to emphasize here are: the failure was not random and it was widespread, and the necessity of operational recovery procedures.

Finally Kreps raises the need for empirically verifiable tests for  NoSQL products. Having spent an inordinate amount of time evaluating NoSQL systems (think cross-country drive instead of a mere drive around the block) this point strongly resonates with me. Distributed systems are notoriously complex beasts since they have to deal with failure in a fundamentally different way than single server systems. NoSQL vendors would be doing the community and themselves a great service if they would make public their high-volume tests and benchmarks. I would be delighted to see these test results posted regularly on public CI build dashboards. In order for NoSQL systems to be more widely accepted in the enterprise community, more transparency is needed regarding scalability issues.  Concrete comparable benchmarks are needed.

A step in this direction would be to develop some sort of standardized NoSQL equivalent of TPC benchmarks. For example, to start off with, implement a series of high-volume and high-throughput tests with a key/value data model. Kick off a few million writes, then reads, some read/writes, deletes along with different record and key sizes and see what happens. Set up a cluster and then start killing and restarting nodes and see if the system behaves correctly and performantly. I’ve put a lot of effort in such a framework called vtest and I know its not easy. By publicly exposing these kind of tests, vendors would only increase confidence in their products.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: