Archive for November, 2011

Hadoop Ecosystem Happenings

November 10, 2011

These are exciting times for the Hadoop ecosystem. Recently there has been a flurry of activities that can be broadly categorized into the following groups:

  • Hadoop services – Cloudera and Hortonworks
  • Hadoop internals – MapR and Hadapt
  • Big vendor adoption – Oracle, IBM, Microsoft, EMC

Hadoop Services Sparks

The launching of Hortonworks last summer has certainly given the established Cloudera a run for its money. If I had to summarize the situation in a nutshell it would be: Cloudera with an established customer base + Doug Cutting vs. Hortonworks with the Yahoo Hadoop development team and lots of promise/potential. Quite an interesting match here – sort of like a Hadoop version of  Thrilla in Manila (RIP Joe Frazier). Hortonworks has finally released its distribution Hortonworks Data Platform.

Some good reads:

And then there is the animated discussion on who contributes more to the Hadoop source repo – e.g.  number of patches vs. lines of code! Very entertaining stuff .

Hadoop Internals Improvement

Despite its runaway success, Hadoop does have technical shortcomings regarding performance and dependability (e.g. NameNode as single point of failure).  For example, see a good article regarding  JobTracker Next Generation of Apache Hadoop MapReduce – The Scheduler. These problems with Hadoop has provided an opportunity for  several new companies (MapR and Hadapt) to deliver proprietary enhancements to address some of the weaknesses.

The widespread adoption of Hive has significantly expanded the end user base of Hadoop, but at the same time it has put a strain on many Hadoop installations. The very expressiveness of Hive has empowered business folks to rather easily tap into the data store. These days the query barrier of entry is low. Accordingly, load has correspondingly significantly increased. Anecdotally, I have personally witnessed individual queries that can result up to a hundred map-reduce jobs! Shops too often blindly throw more hardware at the problem without first performing root analysis of performance issues. There is definitely a business opening here for more performant solutions. As Hadoop becomes more established in the enterprise, higher quality – faster and more reliable – features are in demand.

Map

Earlier this year MapR started shipping two versions of its Hadoop distribution – the free M3 and more advanced M5 for which it charges. A major emphasis is on fixing Hadoop’s single points of failure.

Key features:

  • Distributed highly availabile NameNode
  • NFS support – can mount clusters as NFS volumes
  • Heatmap management console
  • Unlimited number of files
  • Point-in-time snapshots
  • User, job provisioning, quotas, LDAP integration (now that’s classy!)
  • 100% compatibility with Hadoop

Hadapt

Hadapt is a new startup  based on some very interesting work at Yale in the area of advanced database technology (parallel databases, column stores). Early access to its flagship product, Hadapt Adaptive Analytic Platform, was just announced the other day. See  the good article at dbms2 on the latest Hadapt news:  Hadapt happenings Hadapt is moving forward.

Key features:

  • Integrates Hadoop with a database storage engine
  • Universal SQL support – instead of the more general “standard” Hive query language, Hadapt provides a more tightly integrated SQL interface with the database engine (e.g. Postgres).
  • Multi-structured analytics – analysis of both structure and unstructured data
  • Adaptive query execution technology provides on-the-fly load balancing and fault tolerance.
  • An emphasis on cloud and virtualized load balancing – resonates with me.

This is a good example of new hybrid solution emerging in the polyglot persistence space. One should not think of  NoSQL and RDBMS as mutually exclusive propositions. Integrating current analytical tools to new NoSQL sources of data is a promising devleopment.

The use of a RDBMS as the underlying data store strikes me as familiar. On one of my recent NoSQL projects for a major telecom, we decided to replace Voldemort’s default Berkeley DB storage engine with MySQL since the latter (purportedly) benchmarked as faster. No SQL or transactions involved – just retrieval speed for a key/value data model.

Interestingly, even MySQL is now offering a “NoSQL” solution – direct access to the NDB C++ engine via a Memcached interface!

For more information on the theoretical and technical underpinnings of Hadapt, see the background papers at Daniel Abadi’s publications page. Also recommended is his thought-provoking – readable but rigorous – dbmusings blog. There are spot-on discussions on eventual consistency and  CAP theorem design trade-offs as well as other great articles.

One interesting TODO project would be to perform a more rigorous and complete comparison of the MapR and Hadapt products. Firstly you would do a feature set gap analysis. How are the same common problems addressed? Then you would look at the unique value adds that each vendor provides. A more complete analysis would entail running non-trivial comparison performance tests but of course this would require a major investment in hardware and time resources. You could perhaps start with some kind of TeraSort benchmark comparison of the two.

Piccolo – Dov’è?

One a side note of interest, last winter there was a big splash about the Piccolo project which promised to surpass Hadoop performance. Rather mysteriously I haven’t seen any activity or news about them since then. Getting an article in the New York Times is quite a significant achievement  – I wonder what has happened.

Advertisements

Spring Configuration – Selecting An Alternate Implementation

November 9, 2011

A common recurring pattern in software development is the need to select at runtime a specific instance of an interface. This instance can either be a distinct implementation class of an interface or the same class  but instantiated with different properties. Spring provides unparalleled abilities to define different bean instances. These can be categorized as following:

  • Each bean is a different implementation class of the interface.
  • Each bean is the same implementation class but has different configuration.
  • A mixture of the two above.

The canonical example is selecting a mock implementation for testing instead of the actual target production implementation. However there are often business use cases where alternate providers need to be selectively activated.

The goal is to externalize the selection mechanism by providing a way to toggle the desired bean name. We want to avoid  manually commenting/uncommenting bean names inside a Spring XML configuration file. In other words, the key question is: how to toggle the particular implementation?

A brief disclaimer note: this pattern is most applicable to Spring 3.0.x and lower. Spring 3.1 introduces some exciting new features such as bean definition profiles dependent upon different environments. See the following articles for in-depth discussions:

There are two variants of this pattern:

  • Single Implementation – We only need one active implementation at runtime.
  • Multiple Implementations – We need several implementations at runtime so the application can dynamically select the desired one.
Assume we have the following interface:
  public interface NoSqlDao<T extends NoSqlEntity>  {
     public void put(T o) throws Exception;
     public T get(String id) throws Exception;
     public void delete(String id) throws Exception;
  }

  public interface UserProfileDao extends NoSqlDao<UserProfile> {
  }

Assume two implementations of the interface:

  public class CassandraUserProfileDao<T extends UserProfile>
    implements UserProfileDao

  public class MongodbUserProfileDao<T extends UserProfile>
    implements UserProfileDao

Single Loaded Implementation

In this variant of the pattern, you only need one implementation at runtime. Let’s assume that the name of the bean we wish to load is userProfileDao.

  ApplicationContext context = new ClassPathXmlApplicationContext("applicationContext.xml");
  UserProfileDao userProfileDao = context.getBean("userProfileDao",UserProfileDao.class);

The top-level applicationContext.xml file contains common global beans and an import statement for the desired provider. The value of the imported file is externalized as a property called providerConfigFile. Since each provider file is mutually exclusive, the bean name is the same in each file.

  <beans>
    <bean id="propertyConfigurer"
          class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
      <property name="location" value="classpath:context-PropertyOverrideConfigurer.properties" />
      <property name="systemPropertiesModeName" value="SYSTEM_PROPERTIES_MODE_OVERRIDE" />
    </bean>
    <import resource="${providerConfigFile}"/>
  </beans>

The provider-specific configuration files are:

  applicationContextContext-cassandra.xml
  applicationContextContext-mongodb.xml
  applicationContextContext-redis.xml
  applicationContextContext-riak.xml
  applicationContextContext-membase.xml
  applicationContextContext-oracle.xml

For example (note the same bean name userProfileDao):

  applicationContext-cassandra.xml

    <bean id="userProfileDao" class="com.amm.nosql.dao.cassandra.CassandraUserProfileDao" >
      <constructor-arg ref="keyspace.userProfile"/>
      <constructor-arg value="${cassandra.columnFamily.userProfile}"/>
      <constructor-arg ref="userProfileObjectMapper" />
    </bean> 

  applicationContext-mongodb.xml

    <bean id="userProfileDao" class="com.amm.nosql.dao.mongodb.MongodbUserProfileDao">
      <constructor-arg ref="userProfile.collectionFactory" />
      <constructor-arg ref="mongoObjectMapper" />
    </bean>

At runtime you need to specifiy the value for the property providerConfigFile.  Unfortunately with Spring 3.0. this has to be a system property and cannot be specified inside a properties file! This means it will work for a stand-alone Java application but not for a WAR unless you pass the value externally to the web server as a system property. This problem has been allegedly fixed in Spring 3.1 (I didn’t notice it working for 3.1.0.RC1). For example:

  java
    -DproviderConfigFile=applicationContextContext-cassandra.xml
    com.amm.nosql.cli.UserProfileCli

Multiple Loaded Implementations 

With this variant of the pattern, you will need to have all implementations loaded into your application context so you can later decide which one to choose. Instead of one import statement,  applicationContext.xml is imports all implementations.

  <import resource="applicationContextContext-cassandra.xml />
  <import resource="applicationContextContext-mongodb.xml />
  <import resource="applicationContextContext-redis.xml />
  <import resource="applicationContextContext-riak.xml />
  <import resource="applicationContextContext-membase.xml />
  <import resource="applicationContextContext-oracle.xml />

Since you have one namespace, each implementation has to have a unique bean name for the UserProfileDao implementation. Using our previous example:

applicationContext-cassandra.xml

  <bean id="cassandra.userProfileDao" class="com.amm.nosql.dao.cassandra.CassandraUserProfileDao" >
    <constructor-arg ref="keyspace.userProfile"/>
    <constructor-arg value="${cassandra.columnFamily.userProfile}"/>
    <constructor-arg ref="userProfileObjectMapper" />
  </bean> 

applicationContext-mongodb.xml

  <bean id="mongodb.userProfileDao" class="com.amm.nosql.dao.mongodb.MongodbUserProfileDao">
    <constructor-arg ref="userProfile.collectionFactory" />
    <constructor-arg ref="mongoObjectMapper" />
  </bean>

Then inside your Java code you need to have a mechanism to select your desired bean, e.g. load either cassandra.userProfileDao or mongodb.userProfileDao. For example, you could have a test UI containing a dropdown list of all implementations. Or you might have a case where you even had a need to access two different NoSQL stores via a UserProfileDao interface.