Archive for June, 2009

Adventures in the Cloud with AWS

June 27, 2009

Recently I’ve been engaged in an exciting AWS venture – mostly S3 but a bit of EC2. It started with a short contract, but then I subsequently rode the momentum wave and continued hacking away. The following observations are merely my first impressions – nothing more and nothing less.

Time-Sharing is Back!

Firstly, using AWS services seems to be a throwback to the old time-sharing paradigm where every compute cycle is directly billed to you. Ugh! To qualify, this really only makes a difference depending on who is paying and how high the charges are. If your boss is paying for it, then this might not be a big deal for you though it might be a deal for him. It also depends on what service you are using. S3 seems ridiculously cheap – I’ve racked up 16,000 calls that have cost me only $.20 – yes that’s 20 centavos! On the other hand, my smallest EC2 instance was running me over a dollar a day, so this was quickly adding up to be a budget buster for me. At that price, I’d rather go buy a book to read.

Testing Cost

This whole pay-per-cycle model also has interesting ramifications on best practices such as continuous integration test cycles. Assuming you’re constantly running an automated build cycle and your rather extensive integration tests are hitting your AWS infrastructure, you could potentially run up quite a bill. Its a constant recurring cost. Imagine a web site with millions of images that need to be served – what would your tests look like? If you own your own dedicated infrastructure (servers, disks) running tests has no apparent extra cost. With the AWS cloud model, you get billed for anything you do against AWS. Could be OK if you do your math, but it is certainly a brave new world.

Language Client Toolkits

One of the things I’ve noticed is the disparate language-specific toolkits available for AWS. Although I’m mainly a Java guy, I’ve also delved into the Perl (Eric Wagner’s S3:: and Leon Rocard’sNet::Amazon::S3) and Python (boto) libraries. Its most interesting at how the canonical REST/XML model is translated loss-fully to each different client binding. This can be disconcerting and frustrating if you need to work with multiple toolkits and need to have access to the full functionality of the underlying AWS service.

Java Toolkit

The “standard” Java client library is James Murty’s JetS3t which only covers S3 and most recently CloudFront – AWS’s new CDN offering. My question is considering James wrote the O’Reilly bookProgramming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB on all AWS services, why did he only support S3? Hmm… That means I’ve got to pull in another client library which detracts from an integrated solution. If I need to access SQS or SimpleDB, I have to use a completely different library, namelytypica. It would have been better to have one unified consistent approach. No comprendo, no capisco.

AWS Testing Toolkit?

All this leads me to wonder why hasn’t Amazon released their own in-house client libraries that they use to test their AWS services. They do test them, don’t they? So obviously the tests have to be written in some language and they have to be quite extensive. I’m puzzled at why this hasn’t been made available to the community. They could easily release it “as is” with no guarantees if they want to avoid the cost of official support of disparate client bindings. This would certainly be in the “open source” spirit.

Hierarchical Buckets

S3 seems rather primitive to me insofar as it doesn’t support hierarchical buckets. This leads everyone to implement “hacky” virtual directories by using keys such as “mydir/mykey1” to emulate directories. Is it really that hard to implement nested buckets? Since there is obviously a huge need/demand for this, does AWS have a roadmap for this?

REST-ian Batching

Another issue is relating to batching which leads to the more general topic of the difficulty of batching in a REST-ian approach. Say I want to delete a 100 objects in a bucket. Wouldn’t it be so much more efficient to issue this in one call instead of 100 calls? This would certainly makes AWS more scalable and performant, n’est pas?

Conclusion

All the above comments are preliminary and subject to future revisions. I certainly do find the AWS model fascinating, especially their new Public Data Sets, in particular genome databases such as Ensembl Annotated Human Genome Data. There is a real synergy going on here, keep on trucking guys!