Archive for the ‘Hadoop’ Category

MapReduce Design Patterns Book Review

July 28, 2014

I recently came across the delightful book MapReduce Design Patterns by Donald Miner and Adam Shook. The book is an indispensable addition to the collection of any self-respecting big data professional. It is on par with another favorite of mine RESTful Web Services Cookbook. Both books are perfect examples of the right mix of theory and practice. They deal with difficult big data problems that developers and architects encounter, and provide focused, pragmatic and conceptually sound solutions. Kudos to O’Reilly for providing us with such pearls of wisdom.

Several projects that I’ve been involved with could have sorely used this book. Hungry developers eager to adopt hot new Hadoop technologies initially dazzle business folks with their preliminary efforts. The initial results of monetizing large amounts of passive data makes everyone happy.

Here’s one case I witnessed. In first stage, ad server logs from a dozen machines are pushed to a HDFS cluster, MapReduce boils them down to CSV import files, and these are then loaded into a MySQL “data warehouse”.  As ad requests rapidly increase the number of ad servers machines grows to 20 or 30 and the initial architecture starts buckling as the sheer complexity of many moving parts overwhelms the team. Without a proper grounding in big data concepts and frameworks, things fall apart in the face of higher volume and velocity. This is a good example of where quantitative changes result in qualitative ones. In short, the first step is to read MapReduce Design Patterns from cover to cover!

One of the great things about the book is that it covers lower-level patterns such as Numerical Summarization and Counting with Counters but also higher level abstractions such as Metapatterns which are patterns comprised of other patterns such as Job Chaining and Job Merging. You can’t ask for more – there are chapters on Bloom Filtering, Binning, Replicated Joins, etc. Ultimately this would lead us to top-level architecture patterns such as Lambda Architecture – however, this is  the topic of a future blog. The final chapter touches upon future directions regarding non-textual formats such as image audio, video.

My particular favorite  is Chapter 7 Input and Output Patterns as I have recently been trying to hammer out an external table feature for a SQL-on-Hadoop database à la Hive’s StorageHandler. This chapter provides great clarity and insight on the topic. The authors start off by describing the basic underpinnings of key Hadoop interfaces such as InputFormat/RecordReader and the corresponding OutputFormat/RecordWriter from a high-level conceptual perspective. After reading this chapter, you understand the why as well as the how  of  these key interfaces. The chapter then goes beyond the common file-based formats to explore the more adventurous topic of connecting input and output sources to systems outside Hadoop and HDFS. The chosen example is the rocking NoSQL Redis database. This is a must read for anyone trying to connect Hadoop to external systems.

The one issue I had with the book is the “MapReduce” word in its title. Most of the patterns are not in any way specific to MapReduce. Of course, when the book was written MapReduce was the only game in town, but we now have a new generation of execution engines such as Spark and Tez. For the most part, the patterns are equally applicable to the other execution engines or even to a roll-your-own solution. I would love to see a companion book which describes the patterns with Spark code samples.



Adventures in Hive SerDe-landia

July 25, 2014


I was recently involved  in a product requirements effort for external tables for an SQL-on-Hadoop product. The first thing that came to mind was to take a deep dive into Hive’s SerDe and StorageHandler capabilities. The top-level StorageHandler interface in essence represents a good example of a Hadoop design pattern. These features allow you to connect any external data source to Hive – be it a file or network connection.  One of my core beliefs is that before starting any task, it is important do due diligence and see what other folks are doing in the same space. Prior knowledge and awareness of the current state of the art are things all developers and product managers should be required to do.


There are two dimensions to external data source connectivity:

  • The syntax for defining external data sources in Hive QL which is in essence a DSL (domain-specific language).
  • External data sources connector implementations

In my view, the DSL is not consistent and confusing. There are several ways to do the same thing, and some storage handlers are privileged and exposed as keywords in the DSL. For example, attributes of the default text storage handler are treated as first-order keywords instead of SERDEPROPERTIES. This blog will focus on the storage handler implementations.

There are two dimensions to the way Hive stores table data:

  • file format –  How the input source (e.g. file) is broken up into records.
  • row format – How a record is parsed and mapped into database columns. The SerDe interface defines this extension point.

HiveStorageHandler is the top-level interface that exposes the file and row format handlers. It has three methods – two defining the file format’s input and output, and one method defining the SerDe. A SerDe is simply an an aggregation of a Deserializer and Serializer. The nice thing about Hive is that eats its own dog food – the default text-based handler is implemented as a SerDe itself – MetadataTypedColumnsetSerDe.

One API design note. The the file and row format methods of HiveStorageHandler should be treated in a uniform fashion. Currently the two Input/Output file format methods are exposed as methods whereas the row feature is encapsulated in one method that returns SerDe.


MongoDB Hadoop Connector

The MongoDB folks have written a very comprehensive Hadoop connector:

Connect to MongoDB BSON File

This example illustrates how you can connect to BSON dump files from a MongoDB database. One potential problem I see here is mapping large schemas or managing schema changes. Specifying a large list in the DDL can get unwieldy. Having an external configuration file with the mapping might be easier to handle.

CREATE external TABLE person (
  name string,
  yob int,
  status boolean
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
    'mongo.columns.mapping'='{"name" : "name", "yob" : "yob", "status" : "status" ')
  INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
  OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
LOCATION '/data/person';

Connect to MongoDB database

This is equivalent to the BSON example except that we connect to a live database. As you add more documents to your MongoDB database they automatically appear in the Hive table! Note here my earlier complaint about having multiple ways in doing the same thing. The BSON example uses the STORED AS, INPUTFORMAT and OUTPUTFORMAT keywords whereas this example uses STORED BY. This distinction is not very intuitive.

CREATE external TABLE person (
  name string,
  yob int,
  age INT,
  status boolean
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
  'mongo.columns.mapping'='{"name" : "name", "yob" : "yob", "status" : "status" ')

Plugin Ecosystem

Having a well defined interface and productive abstractions has allowed a vigorous third-party plug-in ecosystem to flourish. Besides the text handler, Hive comes bundled with the following storage handlers:

  • Avro (Hive 0.9.1 and later)
  • ORC (Hive 0.11 and later)
  • RegEx
  • Thrift
  • Parquet (Hive 0.13 and later)

Some third-party implementations:

Hadoop Ecosystem Happenings

November 10, 2011

These are exciting times for the Hadoop ecosystem. Recently there has been a flurry of activities that can be broadly categorized into the following groups:

  • Hadoop services – Cloudera and Hortonworks
  • Hadoop internals – MapR and Hadapt
  • Big vendor adoption – Oracle, IBM, Microsoft, EMC

Hadoop Services Sparks

The launching of Hortonworks last summer has certainly given the established Cloudera a run for its money. If I had to summarize the situation in a nutshell it would be: Cloudera with an established customer base + Doug Cutting vs. Hortonworks with the Yahoo Hadoop development team and lots of promise/potential. Quite an interesting match here – sort of like a Hadoop version of  Thrilla in Manila (RIP Joe Frazier). Hortonworks has finally released its distribution Hortonworks Data Platform.

Some good reads:

And then there is the animated discussion on who contributes more to the Hadoop source repo – e.g.  number of patches vs. lines of code! Very entertaining stuff .

Hadoop Internals Improvement

Despite its runaway success, Hadoop does have technical shortcomings regarding performance and dependability (e.g. NameNode as single point of failure).  For example, see a good article regarding  JobTracker Next Generation of Apache Hadoop MapReduce – The Scheduler. These problems with Hadoop has provided an opportunity for  several new companies (MapR and Hadapt) to deliver proprietary enhancements to address some of the weaknesses.

The widespread adoption of Hive has significantly expanded the end user base of Hadoop, but at the same time it has put a strain on many Hadoop installations. The very expressiveness of Hive has empowered business folks to rather easily tap into the data store. These days the query barrier of entry is low. Accordingly, load has correspondingly significantly increased. Anecdotally, I have personally witnessed individual queries that can result up to a hundred map-reduce jobs! Shops too often blindly throw more hardware at the problem without first performing root analysis of performance issues. There is definitely a business opening here for more performant solutions. As Hadoop becomes more established in the enterprise, higher quality – faster and more reliable – features are in demand.


Earlier this year MapR started shipping two versions of its Hadoop distribution – the free M3 and more advanced M5 for which it charges. A major emphasis is on fixing Hadoop’s single points of failure.

Key features:

  • Distributed highly availabile NameNode
  • NFS support – can mount clusters as NFS volumes
  • Heatmap management console
  • Unlimited number of files
  • Point-in-time snapshots
  • User, job provisioning, quotas, LDAP integration (now that’s classy!)
  • 100% compatibility with Hadoop


Hadapt is a new startup  based on some very interesting work at Yale in the area of advanced database technology (parallel databases, column stores). Early access to its flagship product, Hadapt Adaptive Analytic Platform, was just announced the other day. See  the good article at dbms2 on the latest Hadapt news:  Hadapt happenings Hadapt is moving forward.

Key features:

  • Integrates Hadoop with a database storage engine
  • Universal SQL support – instead of the more general “standard” Hive query language, Hadapt provides a more tightly integrated SQL interface with the database engine (e.g. Postgres).
  • Multi-structured analytics – analysis of both structure and unstructured data
  • Adaptive query execution technology provides on-the-fly load balancing and fault tolerance.
  • An emphasis on cloud and virtualized load balancing – resonates with me.

This is a good example of new hybrid solution emerging in the polyglot persistence space. One should not think of  NoSQL and RDBMS as mutually exclusive propositions. Integrating current analytical tools to new NoSQL sources of data is a promising devleopment.

The use of a RDBMS as the underlying data store strikes me as familiar. On one of my recent NoSQL projects for a major telecom, we decided to replace Voldemort’s default Berkeley DB storage engine with MySQL since the latter (purportedly) benchmarked as faster. No SQL or transactions involved – just retrieval speed for a key/value data model.

Interestingly, even MySQL is now offering a “NoSQL” solution – direct access to the NDB C++ engine via a Memcached interface!

For more information on the theoretical and technical underpinnings of Hadapt, see the background papers at Daniel Abadi’s publications page. Also recommended is his thought-provoking – readable but rigorous – dbmusings blog. There are spot-on discussions on eventual consistency and  CAP theorem design trade-offs as well as other great articles.

One interesting TODO project would be to perform a more rigorous and complete comparison of the MapR and Hadapt products. Firstly you would do a feature set gap analysis. How are the same common problems addressed? Then you would look at the unique value adds that each vendor provides. A more complete analysis would entail running non-trivial comparison performance tests but of course this would require a major investment in hardware and time resources. You could perhaps start with some kind of TeraSort benchmark comparison of the two.

Piccolo – Dov’è?

One a side note of interest, last winter there was a big splash about the Piccolo project which promised to surpass Hadoop performance. Rather mysteriously I haven’t seen any activity or news about them since then. Getting an article in the New York Times is quite a significant achievement  – I wonder what has happened.