Tonight I attended my first local Cassandra Users Group meetup. The wet walk in the rain sure was worth it!
About half a dozen people attended – most had a Cassandra deployment in production and all had something interesting to say. No fluff here tonight. Everyone had a high-level of knowledge of EC – it was good not to be the big fish in the NoSQL pond so I could learn from folks out in the field.
Sidebar: Cassandra 1.0 was released yesterday! I wonder what the significance of the release number jump from 0.8.7 to 1.0 is – a new item on my TODO queue.
Here follows a synopsis.
- 5 TB of data
- 21 nodes on AWS – spiked to 42 once – now back to 21
- 90% writes – hence Cassandra is ideal
- Two column families:
- one column – basic key/value
- wide column – key is user ID and columns are his/her emails
- 72 nodes
- 2 data centers
- Cassandra 0.7.6
- 4 nodes
- Raid 0 disks – redundancy is achieved through Cassandra clustering
- 21 million writes a day
- 250 GB – 7 million XML files ranging from 4 to 450 MB in size
- Column family: key/value
- Replication factor is 3
- 80 concurrent write threads, 160 read threads
- QuickLZ compression down to 20%
- Migrated from Mogile FS
Regarding adding new nodes to a live cluster, no one had any negative experiences. For RandomPartitioner, several folks said it was best to double the number of nodes assuming your company can foot the bill. There was apparently no negative impact in rebalancing overhead related to shifting data to the new nodes.
Two folks were using manual server token assignments for a RandomPartitioner – a third person was not keen on this. Asssuming you are using a MD5 hash, you simply divide the 128 bit name space by the number of servers. This apparently was better than letting Cassandra compute the tokens. This was OK even if you added a new node(s) and had to recompute server tokens.
Hadoop integration was important for many – and there was nothing but praise for DataStax’s Brisk tool that integrates Hadoop and Cassandra. Instead of using the standard HDFS (Hadoop Distributed File System) you can simply point Hadoop to use Cassandra. Two advantages to this are:
- Random access of Cassandra is faster than HDFS
- Avoid HDFS SPOF (single point of failure)
One company was very happy with Brisk’s seamless Hive integration. Several people expressed issues with the apparent inability to execute map reduce jobs on more than one column.
Regarding client drivers, the Java users all concurred on the horrors of the accessing Cassandra via the complex Thrift-based data model. Hector marginally alleviates this but it is still painful. For example, compare this to Mongo’s extremely intuitive JSON-based interface! Two companies were using customized Ruby/JRuby gems. I learned that a previously advertised Cassandra option for Avro in addition to Thrift had been dropped.
There was some discussion of other NoSQL vendors – namely MongoDB and Riak. One company was happily using MongoDB. Others had naturally gravitated to Cassandra because of its core competencies of failover robustness and scalability. The fact that MongoDB used a master/slave paradigm for replication was an issue for some – the master is a SPOF for writes. People also commented that Cassandra cluster was easier to manage since Cassandra has only one process per node whereas with sharded MongoDB you need to launch several processes: mongod, mongos, and config.