Archive for the ‘AWS’ Category

Initial Python Impressions

August 12, 2010

This blog is about my initial impressions of Python in the context of an “official Python production project”. I have long used Unix scripts for basic scripting needs, and occasionally used Python (Perl less so) for more substantial tasks but it has always been “unofficial”. My latest gig involved deploying a Python program to listen to incoming AWS SQS messages and dispatch them to a downstream processing engine (business logic, MySQL database).

Though Java has been my bread ‘n butter since its inception, I am firmly in the camp of language non-bigots. I was a coder long before Java, and it is hardly the only show in town. It basically boils down to the best tool for the task at hand. After all, it is all about tools – that’s what launched us Homo sapiens onto our current trajectory towards ultimate civilization.

Python is certainly enticing, and I fully appreciate its appeal. For example, there’s no question that a Python dictionary is so much more convenient to define than a Java map. For example:

mydict = []

instead of its Java equivalent  of:

Map<String,Integer> map = new HashMap<String,Integer>()

or the upcoming Java 7 syntax improvement with inferred typing:

Map<String,Integer> map = new HashMap()

You can also use Google Guava utilities to mitigate this issue for now.

It is obviously so much easier to “whip out” a Python program to execute some basic functionality than a Java equivalent. The crux of the dilemma is: convenience for developers vs. long-term operations concerns.

It basically boils down to two issues (not necessarily unrelated):

  • Type safety
  • Size of team

If you’re one developer or a tight group of like-minded developers, then type safety issues can be mitigated by convention and mind-meld. However, as soon as the team grows, and the life cycle of the application is extended (original developers are no longer involved in maintenance), then problems begin. Its hard to imagine a type-less language such as Python comparing to Java for a large-scale development team where unrelated developers  and hundreds of thousands of lines of code are involved.

For example, without explicit typing, new developers are forced to drill down into the source code to verify method signatures. Typically in the Java world this is handled by Javadoc, IDE magic or mere perusal of source signatures. In Python, you cannot merely look at a method’s source signature for there is none – you have to actually look at the entire method’s code and all its return values (cyclomatic complexity).

An interesting recent article precisely looks at these issues in the migration from Python to Java for Nuxeo’s CMS – see here.

In order to bullet-proof production code, the developer is forced to “play compiler”. To compensate for the lack of a compiler, much of the type-checking should be done by unit tests; these unit tests  would not exist in the Java world. These tests are basically accidental complexity – extra cost – and exist only for type safety.  Here the chickens come home to roost – the trade-off between developer ease of use and run-time stability. Senior Python developers have told me that the “safe” way is to check function return values by either using “isintance()” or checking for specific attributes with “hasattr()”. Whew! This just doesn’t “smell right” to me – too dependent on the whims of individuals. The stuff of nightmares for operations folks trying to discern what went wrong at 3 AM!

One particular place I noticed that this can cause run-time production problems is in the rarely executed “except” clause of a try/except (Java’s try/catch). I ran into unpleasant surprises due to Python’s inexplicable inability to conveniently cast different values in a print statement. Where Java easily concatenates distinct types, Python requires you to cast everything to a string with the str() function if you wish to use the “+” operator – using “,” you don’t, but formatting suffers. Whew, a bit of inconsistency I’d say. You’ll never know this is a problem until an error happens.

Another Python cultural issue that strikes me as “strange” is the lack of true multi-threading due to the GIL (Global Interpreter Lock) limitation. This limitation seems to be an arbitrary constraint due to to the BDL (Benevolent Dictator for Life). Sure, threading is a non-trivial issue – as any tool it can be used or abused. But to summarily dismiss it and force people to spawn processes strikes me as arbitrary and ultimately retro.

Threading concerns can divided into two basic types:

  • Threads that access shared resources that need to be synchronized. Care, diligence and discipline need to be exercised.
  • Threads that access external resources that require no synchronization Goetz et. al. in their seminal book Java Concurrency In Practice call these deferred computations. Since there is no synchronization, programmer complexity is greatly reduced.

It is the latter that is used more often, and thus more important. Forcing users to always spawn processes is unnecessary accidental complexity. For an interesting recent perspective on the subject, see Michele Simonato’s Artima post at Threads, processes and concurrency in Python: some thoughts.


Twitter REST API

May 13, 2010

I recently finished a mini-project to calculate the similarity of two Twitter users. Being a REST fan and having implemented a substantial REST service, I’m always eager for an excuse to get my hands dirty with a bit o’ REST. I find it fascinating that no two REST APIs (web APIs) are the same, especially in terms of documentation.

As we all know the term “REST” refers to a style and not a standard, so it is not surprising that actual implementation vary quite a bit. This lack of  specificity is most pronounced on the client side. With SOAP, client access is almost always mediated by platform-specific client stubs automatically generated from the WSDL contract. With REST there is no such standard contract (despite WADL which has not become a dominant force), and therefore no fully automated way to build client stubs. You can partially automate the process if you define your payload as an XML schema, but this still leaves the other vital part of specifying the resource model. Section 1.3 of the JAX-RS specification explicitly states: The specification will not define client-side APIsSee my comments on the current non-standard situation of client stubs in some JAX-RS providers.

API Data Formats

In the absence of standard client stub generation mechanisms, documentation plays an increasingly important role. The fidelity to genuine REST precepts and the terminology used to describe resources and their HTTP methods becomes of prime importance to effective client usage.

How do we unambiguously describe  the different resources and methods? The number and types of payload formats influence the decision. Do we support only one format, JSON or XML? If XML, do we have a schema? If so, what schema do we use? XSD or RelaxNG? Multiple XML formats such as Atom, RSS and/or proprietary XML? By the way, the former two do not have a defined schema. Do we support multiple formats? If so, do we use prescribed REST content negotiation?

Considering the strong presence of the Twitter REST API and my short albeit intense usage of it, I am a bit reluctant to “criticize”. So upfront I issue a disclaimer that my knowledge is partial and subject to change. One very interesting fact I recently read in the book Building Social Web Applications is that over 80% of Twitter’s usage come from its API and not from its web site! Caramba, that’s quite an ecosystem that has evolved around Twitter! All the more reason to invest in API contract specification and  documentation.

General API Documentation

Professional  high quality API documentation is obviously a vital need especially as API usage increases. With an internet consumer-facing API, clients can access resources using any language of choice, so it is important to be as precise as possible. Having worked with many different APIs and services, I have come to appreciate the importance of good documentation. I regard documentation not as separate add-on to the executable code, but rather as an integral part of the experience. It is a first-order concern.

The metaphor I would suggest is DDD – Documentation Driven Development. In fact, on my last big REST project where I took on the responsibility of documenting the API, I soon found it more efficient to update the documentation as soon as any API change was made. This was especially true when data formats were modified! The document format was Atlassian Wiki which unfortunately didn’t allow for global cross-page changes, so I had to keep code and its corresponding documentation closely synchronized; otherwise the documentation would’ve quickly diverged and become unmanageable.

Deductive and Inductive Documentation

In general, documentation can be divided into deductive and inductive. Deductive documentation is based on the top-down approach. If you have an XML schema all the better – you can use this as your basic axiom, and derive all further documentation fragments in progressive refinement steps. Even in the absence of a schema, you can still leverage this principle.

Inductive documentation is solely based on examples – there is no general definition, and it is up to the client to infer commonalities. You practically have to do parallel diffs on many different XML examples to separate the common from the specific.

Theorem: all good documentation must have examples but it cannot rely only on examples! In other words, examples are necessary but not sufficient.

Google API as a Model

Google has done a great job in extracting a common subset of its many public APIs into Google Data Protocol. All Google APIs share this definition: common data formats, common errors mechanisms, header specification, collection counts, etc. Google has standardized on AtomPub and JSON as its two primary data formats (with some RSS too). It does an excellent job on having an unambiguous and clear specification of its entire protocol across all its API instances.

Take the YouTube API for example. Although neither Google nor Atom use an XML XSD schema, the precise details of the format are clearly described. Atom leverages the concept of extensions where you can insert external namespaces (vocabularies) into the base Atom XML. Google Atom does not have to reinvent the wheel for cross-cutting extensions, and can reuse common XML vocabularies in a standard way. See the Data API Protocol Page – XML element definitions page for details. Some namespaces are openSearch (Open Search Schema) for collection counts and paging, media for MRSS (yes, you can insert RSS into Atom – cool!).

Twitter Data Format Documentation

The Twitter General API documentation page and the FAQ do not do a good job in giving a high-level description of the Twitter data formats for requests and responses. There is only a one basic mention of this on the Things Every Developer Should Know page:

The API presently supports the following data formats: XML, JSON, and the RSS and Atom syndication formats, with some methods only accepting a subset of these formats.

No clear indication is given as to which kinds of resources accept which formats. Common sense would lead us to believe that JSON and proprietary XML are  isomorphic and supported for both request and responses. Being feed formats, RSS and Atom would be supported only for responses.  It is unfortunate that this is not explicitly stated anywhere.

More disturbing is the lack of an XML schema or any attempt to formally define the XML vocabulary! It seems that only XML examples are provided for each resource. Googling for “Twitter API XSD” confirms my suspicion in that it returns many mentions of “inferring XML schemas” from instances – a scary proposition indeed! What is the cardinality of XML elements? Which ones are repeatable or not? What about the data types? The DRY (don’t repeat yourself) principle is violated since you have the same XML example redundantly repeated on many pages. You can maybe get away with this for a small API, but for a widely used API such as Twitter I would have thought Twitter would have invested more resources in API contract specification.

Twitter Content Negotiation

Another concern is the way Twitter handles content negotation. Instead of using the REST convention/standard of  the ACCEPT header or a content query parameter, Twitter appends the format type to the URL (.json, .xml) which in effect creates a new resource. For example Google GData uses a query parameter such as alt=json or alt=rss to indicate data format.

TWitter API Versioning

This lack of explicit contract specification leads to problems regarding versioning. Versioning is one of those very difficult API problems that has no ideal satisfactory answer. Instead, there are partial solutions depending on the use case. Without a contract, it is difficult to even know what new changes have been implemented.

Let’s say a change is made to an XML snippet that is shared across many resource representations. Twitter would have to make changes to each resource documentation page. Even worse, it has to then have some mechanism to inform clients as to the contract change. Having some schema or at least some common way of describing the format would be a much better idea.The Twitter FAQ weakly states that clients have to proactively monitor the following:

Google uses HTTP headers to indicate API versions as well as standard namespace naming conventions.

API Client Packages

Typically REST APIs will leave it to third parties to provide language-specific client stubs. This is understandable because of the large number of languages out there – it would be prohibitive for a small(er) company  to implement and test all these packages! However the downside is that these packages are by definition non-standard (caveat emptor), and it is an open question as to how reliably they implement the current service definition.

Documentation wildly varies.   Firstly, it is not always clear which API to use if several choices exists. Being mostly a Java guy, I focus here on Java clients. For example, Twitter has four Java clients. You most often find minimal Javadoc with no further explanation. API coverage is incomplete – features are missing . For example, for the user-timeline “method” (sidebar: misuse of REST term method!), Twitter4j supports the count query parameter whereas JTwitter apparently does not. The problem here is of client fidelity to the API. When the Twitter API changes, what is the guarantee that the “mom and pop” client will sync up?

Speaking of the devil, a rather interesting development happened just the other day with Amazon’s AWS Java client. On March 22, 2010 Amazon announced  rollout of a brand new AWS SDK for Java! Until then, they too had depended on third-parties – the venerable Jets3t (only supports S3 and CloudFront) and typica. Undoubtedly this was due to client pressure for precisely those reasons enumerated above! See Mr. Jets3t’s comment on the news.


One of the wonders of the human condition is how people manage to work around major obstacles when there is an overriding need. The plethora of Twitter clients in the real world is truly a testimony to the ubiquity of the Twitter API.

However, there is still considerable room for improvement to remove incidental accidental complexity and maximize client productivity. Twitter is still a young company and there is still an obvious maturation process ahead.

After all, Google has many more resources to fine tune the API experience. But if I was the chief Twitter API architect, I would certainly take a long and hard look at the strategic direction of the API. Obviously there is major momentum in this direction especially with the June deprecation of Basic Auth in favor of OAuth and the realignment of the REST and Search APIs. There is no reason to blindly mimic someone else’s API documentation style (think branding), but even less reason not to learn from others (prior knowledge) and to minimize client cognitive overhead.