These are exciting times for the Hadoop ecosystem. Recently there has been a flurry of activities that can be broadly categorized into the following groups:
- Hadoop services – Cloudera and Hortonworks
- Hadoop internals – MapR and Hadapt
- Big vendor adoption – Oracle, IBM, Microsoft, EMC
Hadoop Services Sparks
The launching of Hortonworks last summer has certainly given the established Cloudera a run for its money. If I had to summarize the situation in a nutshell it would be: Cloudera with an established customer base + Doug Cutting vs. Hortonworks with the Yahoo Hadoop development team and lots of promise/potential. Quite an interesting match here – sort of like a Hadoop version of Thrilla in Manila (RIP Joe Frazier). Hortonworks has finally released its distribution Hortonworks Data Platform.
Some good reads:
- How Yahoo Spawned Hadoop, the Future of Big Data – Wired – Nov. 2011 –
- The Hadoop Wars: Cloudera and Hortonworks’ Death Match for Mindshare – wikibon – Nov. 2011
- Yahoo spinoff shakes up Hadoop market with new distro – Gigaom – Nov. 2011
- Cloudera and Hortonworks – dbms2 – July 2011
- The Yahoo! Effect – Hortonworks fires the first round
- The Community Effect – Cloudera responds
- Hortonworks Responds: Counting Hadoop Code and Giving Credit Where Due – ReadWrite – Oct. 2011
Hadoop Internals Improvement
Despite its runaway success, Hadoop does have technical shortcomings regarding performance and dependability (e.g. NameNode as single point of failure). For example, see a good article regarding JobTracker Next Generation of Apache Hadoop MapReduce – The Scheduler. These problems with Hadoop has provided an opportunity for several new companies (MapR and Hadapt) to deliver proprietary enhancements to address some of the weaknesses.
The widespread adoption of Hive has significantly expanded the end user base of Hadoop, but at the same time it has put a strain on many Hadoop installations. The very expressiveness of Hive has empowered business folks to rather easily tap into the data store. These days the query barrier of entry is low. Accordingly, load has correspondingly significantly increased. Anecdotally, I have personally witnessed individual queries that can result up to a hundred map-reduce jobs! Shops too often blindly throw more hardware at the problem without first performing root analysis of performance issues. There is definitely a business opening here for more performant solutions. As Hadoop becomes more established in the enterprise, higher quality – faster and more reliable – features are in demand.
Map
Earlier this year MapR started shipping two versions of its Hadoop distribution – the free M3 and more advanced M5 for which it charges. A major emphasis is on fixing Hadoop’s single points of failure.
Key features:
- Distributed highly availabile NameNode
- NFS support – can mount clusters as NFS volumes
- Heatmap management console
- Unlimited number of files
- Point-in-time snapshots
- User, job provisioning, quotas, LDAP integration (now that’s classy!)
- 100% compatibility with Hadoop
Hadapt
Hadapt is a new startup based on some very interesting work at Yale in the area of advanced database technology (parallel databases, column stores). Early access to its flagship product, Hadapt Adaptive Analytic Platform, was just announced the other day. See the good article at dbms2 on the latest Hadapt news: Hadapt happenings Hadapt is moving forward.
Key features:
- Integrates Hadoop with a database storage engine
- Universal SQL support – instead of the more general “standard” Hive query language, Hadapt provides a more tightly integrated SQL interface with the database engine (e.g. Postgres).
- Multi-structured analytics – analysis of both structure and unstructured data
- Adaptive query execution technology provides on-the-fly load balancing and fault tolerance.
- An emphasis on cloud and virtualized load balancing – resonates with me.
This is a good example of new hybrid solution emerging in the polyglot persistence space. One should not think of NoSQL and RDBMS as mutually exclusive propositions. Integrating current analytical tools to new NoSQL sources of data is a promising devleopment.
The use of a RDBMS as the underlying data store strikes me as familiar. On one of my recent NoSQL projects for a major telecom, we decided to replace Voldemort’s default Berkeley DB storage engine with MySQL since the latter (purportedly) benchmarked as faster. No SQL or transactions involved – just retrieval speed for a key/value data model.
Interestingly, even MySQL is now offering a “NoSQL” solution – direct access to the NDB C++ engine via a Memcached interface!
- Scalable, persistent, HA NoSQL Memcache storage using MySQL Cluster – clusterdb – Oct. 2011
- NoSQL to MySQL with Memcached – also see blog post
- Percona Server now both SQL and NOSQL – Dec. 2010
For more information on the theoretical and technical underpinnings of Hadapt, see the background papers at Daniel Abadi’s publications page. Also recommended is his thought-provoking – readable but rigorous – dbmusings blog. There are spot-on discussions on eventual consistency and CAP theorem design trade-offs as well as other great articles.
One interesting TODO project would be to perform a more rigorous and complete comparison of the MapR and Hadapt products. Firstly you would do a feature set gap analysis. How are the same common problems addressed? Then you would look at the unique value adds that each vendor provides. A more complete analysis would entail running non-trivial comparison performance tests but of course this would require a major investment in hardware and time resources. You could perhaps start with some kind of TeraSort benchmark comparison of the two.
Piccolo – Dov’è?
One a side note of interest, last winter there was a big splash about the Piccolo project which promised to surpass Hadoop performance. Rather mysteriously I haven’t seen any activity or news about them since then. Getting an article in the New York Times is quite a significant achievement – I wonder what has happened.
- Piccolo Project Tries to Speed Past Hadoop – NYT – Feb. 2011
- Piccolo – Building Distributed Programs That Are 11x Faster Than Hadoop – HighScalability – Feb. 2011
- Piccolo Project Tries to Speed Past Hadoop – Gigacom – Feb. 2011
Leave a comment