Archive for July, 2014

MapReduce Design Patterns Book Review

July 28, 2014

I recently came across the delightful book MapReduce Design Patterns by Donald Miner and Adam Shook. The book is an indispensable addition to the collection of any self-respecting big data professional. It is on par with another favorite of mine RESTful Web Services Cookbook. Both books are perfect examples of the right mix of theory and practice. They deal with difficult big data problems that developers and architects encounter, and provide focused, pragmatic and conceptually sound solutions. Kudos to O’Reilly for providing us with such pearls of wisdom.

Several projects that I’ve been involved with could have sorely used this book. Hungry developers eager to adopt hot new Hadoop technologies initially dazzle business folks with their preliminary efforts. The initial results of monetizing large amounts of passive data makes everyone happy.

Here’s one case I witnessed. In first stage, ad server logs from a dozen machines are pushed to a HDFS cluster, MapReduce boils them down to CSV import files, and these are then loaded into a MySQL “data warehouse”.  As ad requests rapidly increase the number of ad servers machines grows to 20 or 30 and the initial architecture starts buckling as the sheer complexity of many moving parts overwhelms the team. Without a proper grounding in big data concepts and frameworks, things fall apart in the face of higher volume and velocity. This is a good example of where quantitative changes result in qualitative ones. In short, the first step is to read MapReduce Design Patterns from cover to cover!

One of the great things about the book is that it covers lower-level patterns such as Numerical Summarization and Counting with Counters but also higher level abstractions such as Metapatterns which are patterns comprised of other patterns such as Job Chaining and Job Merging. You can’t ask for more – there are chapters on Bloom Filtering, Binning, Replicated Joins, etc. Ultimately this would lead us to top-level architecture patterns such as Lambda Architecture – however, this is  the topic of a future blog. The final chapter touches upon future directions regarding non-textual formats such as image audio, video.

My particular favorite  is Chapter 7 Input and Output Patterns as I have recently been trying to hammer out an external table feature for a SQL-on-Hadoop database à la Hive’s StorageHandler. This chapter provides great clarity and insight on the topic. The authors start off by describing the basic underpinnings of key Hadoop interfaces such as InputFormat/RecordReader and the corresponding OutputFormat/RecordWriter from a high-level conceptual perspective. After reading this chapter, you understand the why as well as the how  of  these key interfaces. The chapter then goes beyond the common file-based formats to explore the more adventurous topic of connecting input and output sources to systems outside Hadoop and HDFS. The chosen example is the rocking NoSQL Redis database. This is a must read for anyone trying to connect Hadoop to external systems.

The one issue I had with the book is the “MapReduce” word in its title. Most of the patterns are not in any way specific to MapReduce. Of course, when the book was written MapReduce was the only game in town, but we now have a new generation of execution engines such as Spark and Tez. For the most part, the patterns are equally applicable to the other execution engines or even to a roll-your-own solution. I would love to see a companion book which describes the patterns with Spark code samples.

Resources:

Adventures in Hive SerDe-landia

July 25, 2014

Overview

I was recently involved  in a product requirements effort for external tables for an SQL-on-Hadoop product. The first thing that came to mind was to take a deep dive into Hive’s SerDe and StorageHandler capabilities. The top-level StorageHandler interface in essence represents a good example of a Hadoop design pattern. These features allow you to connect any external data source to Hive – be it a file or network connection.  One of my core beliefs is that before starting any task, it is important do due diligence and see what other folks are doing in the same space. Prior knowledge and awareness of the current state of the art are things all developers and product managers should be required to do.

Basic

There are two dimensions to external data source connectivity:

  • The syntax for defining external data sources in Hive QL which is in essence a DSL (domain-specific language).
  • External data sources connector implementations

In my view, the DSL is not consistent and confusing. There are several ways to do the same thing, and some storage handlers are privileged and exposed as keywords in the DSL. For example, attributes of the default text storage handler are treated as first-order keywords instead of SERDEPROPERTIES. This blog will focus on the storage handler implementations.

There are two dimensions to the way Hive stores table data:

  • file format –  How the input source (e.g. file) is broken up into records.
  • row format – How a record is parsed and mapped into database columns. The SerDe interface defines this extension point.

HiveStorageHandler is the top-level interface that exposes the file and row format handlers. It has three methods – two defining the file format’s input and output, and one method defining the SerDe. A SerDe is simply an an aggregation of a Deserializer and Serializer. The nice thing about Hive is that eats its own dog food – the default text-based handler is implemented as a SerDe itself – MetadataTypedColumnsetSerDe.

One API design note. The the file and row format methods of HiveStorageHandler should be treated in a uniform fashion. Currently the two Input/Output file format methods are exposed as methods whereas the row feature is encapsulated in one method that returns SerDe.

HiveStorageHandler

MongoDB Hadoop Connector

The MongoDB folks have written a very comprehensive Hadoop connector:

Connect to MongoDB BSON File

This example illustrates how you can connect to BSON dump files from a MongoDB database. One potential problem I see here is mapping large schemas or managing schema changes. Specifying a large list in the DDL can get unwieldy. Having an external configuration file with the mapping might be easier to handle.

CREATE external TABLE person (
  name string,
  yob int,
  status boolean
)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
  WITH SERDEPROPERTIES(
    'mongo.columns.mapping'='{"name" : "name", "yob" : "yob", "status" : "status" ')
STORED AS
  INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
  OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
LOCATION '/data/person';

Connect to MongoDB database

This is equivalent to the BSON example except that we connect to a live database. As you add more documents to your MongoDB database they automatically appear in the Hive table! Note here my earlier complaint about having multiple ways in doing the same thing. The BSON example uses the STORED AS, INPUTFORMAT and OUTPUTFORMAT keywords whereas this example uses STORED BY. This distinction is not very intuitive.

CREATE external TABLE person (
  name string,
  yob int,
  age INT,
  status boolean
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES(
  'mongo.columns.mapping'='{"name" : "name", "yob" : "yob", "status" : "status" ')
TBLPROPERTIES(
  'mongo.uri'='mongodb://localhost:27017/test.persons');

Plugin Ecosystem

Having a well defined interface and productive abstractions has allowed a vigorous third-party plug-in ecosystem to flourish. Besides the text handler, Hive comes bundled with the following storage handlers:

  • Avro (Hive 0.9.1 and later)
  • ORC (Hive 0.11 and later)
  • RegEx
  • Thrift
  • Parquet (Hive 0.13 and later)

Some third-party implementations: