Hadoop Ecosystem Happenings

These are exciting times for the Hadoop ecosystem. Recently there has been a flurry of activities that can be broadly categorized into the following groups:

Hadoop services – Cloudera and Hortonworks
Hadoop internals – MapR and Hadapt
Big vendor adoption – Oracle, IBM, Microsoft, EMC

Hadoop Services Sparks

The launching of Hortonworks last summer has certainly given the established Cloudera a run for its money. If I had to summarize the situation in a nutshell it would be: Cloudera with an established customer base + Doug Cutting vs. Hortonworks with the Yahoo Hadoop development team and lots of promise/potential. Quite an interesting match here – sort of like a Hadoop version of Thrilla in Manila (RIP Joe Frazier). Hortonworks has finally released its distribution Hortonworks Data Platform.

Some good reads:

How Yahoo Spawned Hadoop, the Future of Big Data – Wired – Nov. 2011 –
The Hadoop Wars: Cloudera and Hortonworks’ Death Match for Mindshare – wikibon – Nov. 2011
Yahoo spinoff shakes up Hadoop market with new distro – Gigaom – Nov. 2011
Cloudera and Hortonworks – dbms2 – July 2011

And then there is the animated discussion on who contributes more to the Hadoop source repo – e.g. number of patches vs. lines of code! Very entertaining stuff .

The Yahoo! Effect – Hortonworks fires the first round
The Community Effect – Cloudera responds
Hortonworks Responds: Counting Hadoop Code and Giving Credit Where Due – ReadWrite – Oct. 2011

Hadoop Internals Improvement

Despite its runaway success, Hadoop does have technical shortcomings regarding performance and dependability (e.g. NameNode as single point of failure). For example, see a good article regarding JobTracker Next Generation of Apache Hadoop MapReduce – The Scheduler. These problems with Hadoop has provided an opportunity for several new companies (MapR and Hadapt) to deliver proprietary enhancements to address some of the weaknesses.

The widespread adoption of Hive has significantly expanded the end user base of Hadoop, but at the same time it has put a strain on many Hadoop installations. The very expressiveness of Hive has empowered business folks to rather easily tap into the data store. These days the query barrier of entry is low. Accordingly, load has correspondingly significantly increased. Anecdotally, I have personally witnessed individual queries that can result up to a hundred map-reduce jobs! Shops too often blindly throw more hardware at the problem without first performing root analysis of performance issues. There is definitely a business opening here for more performant solutions. As Hadoop becomes more established in the enterprise, higher quality – faster and more reliable – features are in demand.

Map

Earlier this year MapR started shipping two versions of its Hadoop distribution – the free M3 and more advanced M5 for which it charges. A major emphasis is on fixing Hadoop’s single points of failure.

Key features:

Distributed highly availabile NameNode
NFS support – can mount clusters as NFS volumes
Heatmap management console
Unlimited number of files
Point-in-time snapshots
User, job provisioning, quotas, LDAP integration (now that’s classy!)
100% compatibility with Hadoop

Hadapt

Hadapt is a new startup based on some very interesting work at Yale in the area of advanced database technology (parallel databases, column stores). Early access to its flagship product, Hadapt Adaptive Analytic Platform, was just announced the other day. See the good article at dbms2 on the latest Hadapt news: Hadapt happenings Hadapt is moving forward.

Key features:

Integrates Hadoop with a database storage engine
Universal SQL support – instead of the more general “standard” Hive query language, Hadapt provides a more tightly integrated SQL interface with the database engine (e.g. Postgres).
Multi-structured analytics – analysis of both structure and unstructured data
Adaptive query execution technology provides on-the-fly load balancing and fault tolerance.
An emphasis on cloud and virtualized load balancing – resonates with me.

This is a good example of new hybrid solution emerging in the polyglot persistence space. One should not think of NoSQL and RDBMS as mutually exclusive propositions. Integrating current analytical tools to new NoSQL sources of data is a promising devleopment.

The use of a RDBMS as the underlying data store strikes me as familiar. On one of my recent NoSQL projects for a major telecom, we decided to replace Voldemort’s default Berkeley DB storage engine with MySQL since the latter (purportedly) benchmarked as faster. No SQL or transactions involved – just retrieval speed for a key/value data model.

Interestingly, even MySQL is now offering a “NoSQL” solution – direct access to the NDB C++ engine via a Memcached interface!

Scalable, persistent, HA NoSQL Memcache storage using MySQL Cluster – clusterdb – Oct. 2011
NoSQL to MySQL with Memcached – also see blog post
Percona Server now both SQL and NOSQL – Dec. 2010

For more information on the theoretical and technical underpinnings of Hadapt, see the background papers at Daniel Abadi’s publications page. Also recommended is his thought-provoking – readable but rigorous – dbmusings blog. There are spot-on discussions on eventual consistency and CAP theorem design trade-offs as well as other great articles.

One interesting TODO project would be to perform a more rigorous and complete comparison of the MapR and Hadapt products. Firstly you would do a feature set gap analysis. How are the same common problems addressed? Then you would look at the unique value adds that each vendor provides. A more complete analysis would entail running non-trivial comparison performance tests but of course this would require a major investment in hardware and time resources. You could perhaps start with some kind of TeraSort benchmark comparison of the two.

Piccolo – Dov’è?

One a side note of interest, last winter there was a big splash about the Piccolo project which promised to surpass Hadoop performance. Rather mysteriously I haven’t seen any activity or news about them since then. Getting an article in the New York Times is quite a significant achievement – I wonder what has happened.

Piccolo Project Tries to Speed Past Hadoop – NYT – Feb. 2011
Piccolo – Building Distributed Programs That Are 11x Faster Than Hadoop – HighScalability – Feb. 2011
Piccolo Project Tries to Speed Past Hadoop – Gigacom – Feb. 2011

This entry was posted on November 10, 2011 at 18:52 and is filed under Hadoop. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Andre's Tech Blog