Twitter User Similarity and Collective Intelligence

This is the second part of a blog regarding a recent mini-project where I implemented a Twitter user similarity service.  The first part described my experience with the mechanics of the Twitter REST API – this part focuses more on the “collective intelligence” aspects.

The requirements were simple: define a concept of “similarity for two users” and implement a Twitter solution for it.

Resources used:

Being a bit rusty on basic CI concepts (my chagrin but in my defense there is so much computer stuff to know out there), I did a quick search for high quality links on the topic, drilled down a bit and read the high-value articles. I downloaded all free PDF chapters of the books and read relevant sections.  I went to my local Borders bookstore which was fortunately stocked with all the above books. I had already purchased Segaran’s book, so I used chapter 2 “Making Recommendations” which discusses the Euclidean distance Pearson correllation formula and this seemed to fit the bill. AIW also had an even more detailed discussion on the subject – too much to implement in the short time frame, but definitely a candidate for version two. I reviewed my statistics books, and lo and behold it turned out these were not exotic algorithms, but rather standard statistics data  comparison techniques. Too paraphrase an old sailing jingle: so much knowledge, so little time (so many boats, so little time).

I settled upon a defining the concept of similarity based on comparing word counts between two users for a set of Twitter status message for a given timeline. As usual I leveraged Spring for effortless configuration and bean wiring (thanks again Rod!). The basic logic was to issue calls to the Twitter API “method” user_timeline for each user. This returned a list of tweets for each user which I would iterate over and concatenate the Status text elements. I then computed a map of words and a count of all their occurences. This map was then  fed to the similarity scorer which would return a value between 0 and 1.

Last but not least was the WordCounter class. This object accepts raw text and returns a map of words and their counts. Of special interest is the lexical analyzer. For the first pass I used a simple String.split() and a list of stop words. But minimal analysis revealed a submerged world of complexity involving punctuations, stemming, etc. Whew! Ideally it too should be in interface.

Here’s a UML class diagram of the overall system:

The entry point is a service which returns a double value between 0 and 1 indicating user similarity.

    public interface SimilarityService {
        public double getSimilarityScore(String user1, String user2)
    }

This service interface has four implementations: two real providers (Twitter4j and JTwitter) that issue actual calls to the Twitter API for two user timelines. The mock implementation operated on files containing the concatenated raw text. As an inspirational freebie, I threw in the RssSimilarity provider which performed the similarity scoring on RSS feeds. Its quite cool at how much can be done so easily and quickly when you’ve got the right abstractions and layering in place. Nothing excessively fancy here except solid engineering practices all wrapped in rocking Spring. The other extension point was the similary scorer which computed a similarity score for two word count maps.

    public interface SimilarityScorer {
         public double calculateSimilarityScore(Map wordCount1,
              Map wordCount2);
    }

The two provided implementations are:

  • Euclidean Distance
  • Pearson Corellation

Other possible candidate solutions to be investigated are:

  • Manhattan (taxicab) distance
  • Jaccard distance

Overall, this was one of the more intellectually challenging projects in a while. On the “interest” scale it certainly compares with the NoSQL and eventual consistency stuff I’ve been recently doing. I certainly aim to pursue this topic more – hopefully in a remunerated capacity!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: