Overview
I was recently involved in a product requirements effort for external tables for an SQL-on-Hadoop product. The first thing that came to mind was to take a deep dive into Hive’s SerDe and StorageHandler capabilities. The top-level StorageHandler interface in essence represents a good example of a Hadoop design pattern. These features allow you to connect any external data source to Hive – be it a file or network connection. One of my core beliefs is that before starting any task, it is important do due diligence and see what other folks are doing in the same space. Prior knowledge and awareness of the current state of the art are things all developers and product managers should be required to do.
Basic
There are two dimensions to external data source connectivity:
- The syntax for defining external data sources in Hive QL which is in essence a DSL (domain-specific language).
- External data sources connector implementations
In my view, the DSL is not consistent and confusing. There are several ways to do the same thing, and some storage handlers are privileged and exposed as keywords in the DSL. For example, attributes of the default text storage handler are treated as first-order keywords instead of SERDEPROPERTIES. This blog will focus on the storage handler implementations.
There are two dimensions to the way Hive stores table data:
- file format – How the input source (e.g. file) is broken up into records.
- row format – How a record is parsed and mapped into database columns. The SerDe interface defines this extension point.
HiveStorageHandler is the top-level interface that exposes the file and row format handlers. It has three methods – two defining the file format’s input and output, and one method defining the SerDe. A SerDe is simply an an aggregation of a Deserializer and Serializer. The nice thing about Hive is that eats its own dog food – the default text-based handler is implemented as a SerDe itself – MetadataTypedColumnsetSerDe.
One API design note. The the file and row format methods of HiveStorageHandler should be treated in a uniform fashion. Currently the two Input/Output file format methods are exposed as methods whereas the row feature is encapsulated in one method that returns SerDe.
MongoDB Hadoop Connector
The MongoDB folks have written a very comprehensive Hadoop connector:
- https://github.com/mongodb/mongo-hadoop
- https://github.com/mongodb/mongo-hadoop/blob/master/hive/README.md
Connect to MongoDB BSON File
This example illustrates how you can connect to BSON dump files from a MongoDB database. One potential problem I see here is mapping large schemas or managing schema changes. Specifying a large list in the DDL can get unwieldy. Having an external configuration file with the mapping might be easier to handle.
CREATE external TABLE person ( name string, yob int , status boolean ) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' WITH SERDEPROPERTIES( 'mongo.columns.mapping' = '{"name" : "name", "yob" : "yob", "status" : "status" ' ) STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat' OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat' LOCATION '/data/person' ; |
Connect to MongoDB database
This is equivalent to the BSON example except that we connect to a live database. As you add more documents to your MongoDB database they automatically appear in the Hive table! Note here my earlier complaint about having multiple ways in doing the same thing. The BSON example uses the STORED AS, INPUTFORMAT and OUTPUTFORMAT keywords whereas this example uses STORED BY. This distinction is not very intuitive.
CREATE external TABLE person ( name string, yob int , age INT, status boolean ) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES( 'mongo.columns.mapping' = '{"name" : "name", "yob" : "yob", "status" : "status" ' ) TBLPROPERTIES( 'mongo.uri' = 'mongodb://localhost:27017/test.persons' ); |
Plugin Ecosystem
Having a well defined interface and productive abstractions has allowed a vigorous third-party plug-in ecosystem to flourish. Besides the text handler, Hive comes bundled with the following storage handlers:
Some third-party implementations:
- MongoDB Connector for Hadoop – https://github.com/mongodb/mongo-hadoop
- Independent MongoDB storage handler – https://github.com/yc-huang/Hive-mongo
- Hive Cassandra Handler – https://github.com/tuplejump/cash
- Cloudera Twitter – com.cloudera.hive.serde.JSONSerDe