Hadoop Ecosystem Happenings

These are exciting times for the Hadoop ecosystem. Recently there has been a flurry of activities that can be broadly categorized into the following groups:

  • Hadoop services – Cloudera and Hortonworks
  • Hadoop internals – MapR and Hadapt
  • Big vendor adoption – Oracle, IBM, Microsoft, EMC

Hadoop Services Sparks

The launching of Hortonworks last summer has certainly given the established Cloudera a run for its money. If I had to summarize the situation in a nutshell it would be: Cloudera with an established customer base + Doug Cutting vs. Hortonworks with the Yahoo Hadoop development team and lots of promise/potential. Quite an interesting match here – sort of like a Hadoop version of  Thrilla in Manila (RIP Joe Frazier). Hortonworks has finally released its distribution Hortonworks Data Platform.

Some good reads:

And then there is the animated discussion on who contributes more to the Hadoop source repo – e.g.  number of patches vs. lines of code! Very entertaining stuff .

Hadoop Internals Improvement

Despite its runaway success, Hadoop does have technical shortcomings regarding performance and dependability (e.g. NameNode as single point of failure).  For example, see a good article regarding  JobTracker Next Generation of Apache Hadoop MapReduce – The Scheduler. These problems with Hadoop has provided an opportunity for  several new companies (MapR and Hadapt) to deliver proprietary enhancements to address some of the weaknesses.

The widespread adoption of Hive has significantly expanded the end user base of Hadoop, but at the same time it has put a strain on many Hadoop installations. The very expressiveness of Hive has empowered business folks to rather easily tap into the data store. These days the query barrier of entry is low. Accordingly, load has correspondingly significantly increased. Anecdotally, I have personally witnessed individual queries that can result up to a hundred map-reduce jobs! Shops too often blindly throw more hardware at the problem without first performing root analysis of performance issues. There is definitely a business opening here for more performant solutions. As Hadoop becomes more established in the enterprise, higher quality – faster and more reliable – features are in demand.

Map

Earlier this year MapR started shipping two versions of its Hadoop distribution – the free M3 and more advanced M5 for which it charges. A major emphasis is on fixing Hadoop’s single points of failure.

Key features:

  • Distributed highly availabile NameNode
  • NFS support – can mount clusters as NFS volumes
  • Heatmap management console
  • Unlimited number of files
  • Point-in-time snapshots
  • User, job provisioning, quotas, LDAP integration (now that’s classy!)
  • 100% compatibility with Hadoop

Hadapt

Hadapt is a new startup  based on some very interesting work at Yale in the area of advanced database technology (parallel databases, column stores). Early access to its flagship product, Hadapt Adaptive Analytic Platform, was just announced the other day. See  the good article at dbms2 on the latest Hadapt news:  Hadapt happenings Hadapt is moving forward.

Key features:

  • Integrates Hadoop with a database storage engine
  • Universal SQL support – instead of the more general “standard” Hive query language, Hadapt provides a more tightly integrated SQL interface with the database engine (e.g. Postgres).
  • Multi-structured analytics – analysis of both structure and unstructured data
  • Adaptive query execution technology provides on-the-fly load balancing and fault tolerance.
  • An emphasis on cloud and virtualized load balancing – resonates with me.

This is a good example of new hybrid solution emerging in the polyglot persistence space. One should not think of  NoSQL and RDBMS as mutually exclusive propositions. Integrating current analytical tools to new NoSQL sources of data is a promising devleopment.

The use of a RDBMS as the underlying data store strikes me as familiar. On one of my recent NoSQL projects for a major telecom, we decided to replace Voldemort’s default Berkeley DB storage engine with MySQL since the latter (purportedly) benchmarked as faster. No SQL or transactions involved – just retrieval speed for a key/value data model.

Interestingly, even MySQL is now offering a “NoSQL” solution – direct access to the NDB C++ engine via a Memcached interface!

For more information on the theoretical and technical underpinnings of Hadapt, see the background papers at Daniel Abadi’s publications page. Also recommended is his thought-provoking – readable but rigorous – dbmusings blog. There are spot-on discussions on eventual consistency and  CAP theorem design trade-offs as well as other great articles.

One interesting TODO project would be to perform a more rigorous and complete comparison of the MapR and Hadapt products. Firstly you would do a feature set gap analysis. How are the same common problems addressed? Then you would look at the unique value adds that each vendor provides. A more complete analysis would entail running non-trivial comparison performance tests but of course this would require a major investment in hardware and time resources. You could perhaps start with some kind of TeraSort benchmark comparison of the two.

Piccolo – Dov’è?

One a side note of interest, last winter there was a big splash about the Piccolo project which promised to surpass Hadoop performance. Rather mysteriously I haven’t seen any activity or news about them since then. Getting an article in the New York Times is quite a significant achievement  – I wonder what has happened.

Leave a comment