Adventures in Hive SerDe-landia

Overview

I was recently involved  in a product requirements effort for external tables for an SQL-on-Hadoop product. The first thing that came to mind was to take a deep dive into Hive’s SerDe and StorageHandler capabilities. The top-level StorageHandler interface in essence represents a good example of a Hadoop design pattern. These features allow you to connect any external data source to Hive – be it a file or network connection.  One of my core beliefs is that before starting any task, it is important do due diligence and see what other folks are doing in the same space. Prior knowledge and awareness of the current state of the art are things all developers and product managers should be required to do.

Basic

There are two dimensions to external data source connectivity:

  • The syntax for defining external data sources in Hive QL which is in essence a DSL (domain-specific language).
  • External data sources connector implementations

In my view, the DSL is not consistent and confusing. There are several ways to do the same thing, and some storage handlers are privileged and exposed as keywords in the DSL. For example, attributes of the default text storage handler are treated as first-order keywords instead of SERDEPROPERTIES. This blog will focus on the storage handler implementations.

There are two dimensions to the way Hive stores table data:

  • file format –  How the input source (e.g. file) is broken up into records.
  • row format – How a record is parsed and mapped into database columns. The SerDe interface defines this extension point.

HiveStorageHandler is the top-level interface that exposes the file and row format handlers. It has three methods – two defining the file format’s input and output, and one method defining the SerDe. A SerDe is simply an an aggregation of a Deserializer and Serializer. The nice thing about Hive is that eats its own dog food – the default text-based handler is implemented as a SerDe itself – MetadataTypedColumnsetSerDe.

One API design note. The the file and row format methods of HiveStorageHandler should be treated in a uniform fashion. Currently the two Input/Output file format methods are exposed as methods whereas the row feature is encapsulated in one method that returns SerDe.

HiveStorageHandler

MongoDB Hadoop Connector

The MongoDB folks have written a very comprehensive Hadoop connector:

Connect to MongoDB BSON File

This example illustrates how you can connect to BSON dump files from a MongoDB database. One potential problem I see here is mapping large schemas or managing schema changes. Specifying a large list in the DDL can get unwieldy. Having an external configuration file with the mapping might be easier to handle.

CREATE external TABLE person (
  name string,
  yob int,
  status boolean
)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
  WITH SERDEPROPERTIES(
    'mongo.columns.mapping'='{"name" : "name", "yob" : "yob", "status" : "status" ')
STORED AS
  INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
  OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
LOCATION '/data/person';

Connect to MongoDB database

This is equivalent to the BSON example except that we connect to a live database. As you add more documents to your MongoDB database they automatically appear in the Hive table! Note here my earlier complaint about having multiple ways in doing the same thing. The BSON example uses the STORED AS, INPUTFORMAT and OUTPUTFORMAT keywords whereas this example uses STORED BY. This distinction is not very intuitive.

CREATE external TABLE person (
  name string,
  yob int,
  age INT,
  status boolean
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES(
  'mongo.columns.mapping'='{"name" : "name", "yob" : "yob", "status" : "status" ')
TBLPROPERTIES(
  'mongo.uri'='mongodb://localhost:27017/test.persons');

Plugin Ecosystem

Having a well defined interface and productive abstractions has allowed a vigorous third-party plug-in ecosystem to flourish. Besides the text handler, Hive comes bundled with the following storage handlers:

  • Avro (Hive 0.9.1 and later)
  • ORC (Hive 0.11 and later)
  • RegEx
  • Thrift
  • Parquet (Hive 0.13 and later)

Some third-party implementations:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: