Cassandra Users Group Impressions

Tonight I attended my first local Cassandra Users Group meetup. The wet walk in the rain sure was worth it!

About half a dozen people attended – most had a Cassandra deployment in production and all had something interesting to say. No fluff here tonight. Everyone had a high-level of knowledge of EC – it was good not to be the big fish in the NoSQL pond so I could learn from folks out in the field.

Sidebar: Cassandra 1.0 was released yesterday! I wonder what the significance of the release number jump from 0.8.7 to 1.0 is – a new item on my TODO queue.

Here follows a synopsis.

Company B

5 TB of data
21 nodes on AWS – spiked to 42 once – now back to 21
90% writes – hence Cassandra is ideal
Two column families:

one column – basic key/value
wide column – key is user ID and columns are his/her emails

Company C

72 nodes
2 data centers

Company Z

Cassandra 0.7.6
4 nodes
Raid 0 disks – redundancy is achieved through Cassandra clustering
21 million writes a day
250 GB – 7 million XML files ranging from 4 to 450 MB in size
Column family: key/value
Replication factor is 3
80 concurrent write threads, 160 read threads
QuickLZ compression down to 20%
Migrated from Mogile FS

Other

Regarding adding new nodes to a live cluster, no one had any negative experiences. For RandomPartitioner, several folks said it was best to double the number of nodes assuming your company can foot the bill. There was apparently no negative impact in rebalancing overhead related to shifting data to the new nodes.

Two folks were using manual server token assignments for a RandomPartitioner – a third person was not keen on this. Asssuming you are using a MD5 hash, you simply divide the 128 bit name space by the number of servers. This apparently was better than letting Cassandra compute the tokens. This was OK even if you added a new node(s) and had to recompute server tokens.

Hadoop integration was important for many – and there was nothing but praise for DataStax’s Brisk tool that integrates Hadoop and Cassandra. Instead of using the standard HDFS (Hadoop Distributed File System) you can simply point Hadoop to use Cassandra. Two advantages to this are:

Random access of Cassandra is faster than HDFS
Avoid HDFS SPOF (single point of failure)

One company was very happy with Brisk’s seamless Hive integration. Several people expressed issues with the apparent inability to execute map reduce jobs on more than one column.

Regarding client drivers, the Java users all concurred on the horrors of the accessing Cassandra via the complex Thrift-based data model. Hector marginally alleviates this but it is still painful. For example, compare this to Mongo’s extremely intuitive JSON-based interface! Two companies were using customized Ruby/JRuby gems. I learned that a previously advertised Cassandra option for Avro in addition to Thrift had been dropped.

There was some discussion of other NoSQL vendors – namely MongoDB and Riak. One company was happily using MongoDB. Others had naturally gravitated to Cassandra because of its core competencies of failover robustness and scalability. The fact that MongoDB used a master/slave paradigm for replication was an issue for some – the master is a SPOF for writes. People also commented that Cassandra cluster was easier to manage since Cassandra has only one process per node whereas with sharded MongoDB you need to launch several processes: mongod, mongos, and config.

This entry was posted on October 20, 2011 at 05:06 and is filed under Cassandra, Mongo, No SQL. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Andre's Tech Blog