Sourcing Secrets: Surviving Amazon cloudbursts

Amazon was in the news of late for the wrong reasons again. In April this year, the Amazon web services (AWS) hosted off their Northern Virgina data center started experiencing major connectivity and latency issues. This escalated into a full blown outage lasting 24 hours and took down hundreds of AWS clients' services including those of Foursquare, Reddit and Wildfire. The event received mainstream media coverage in CNN, The New York Times and Washington Post. Not only was it an embarrassment for Amazon, but it also raised nagging doubts in the minds of the fence-sitters on the cloud question. Wasn't - reliability through redundancy a key promise of the Cloud? What happened to all that?

What got merely a passing mention in that story, was the fact that Amazon's key customer Netflix was not affected by the outage at all. Now - the customer with the biggest exposure does not feel a thing? How does that happen? Did they get special treatment or somehow got lucky ?

Turns out it was neither. In fact, there is a whole sub-story in there that got discussed in primarily technical circuits - and that tell us a thing or two about the best practices of a cloud sourcing strategy.

Before we delve into that, for the record, here is Amazon's "official" report on the event. Unsurprisingly, it is mixed with a lot of marketing sugar coating which makes it difficult to understand what the core issue was. There are also several technical post-mortems like the one the CNN article refers to. But from a sourcing strategy standpoint, Netflix's Techblog post on the event hits the sweet spot. Here are some takeaways from it:

1. Hope for the best (but ~~prepare~~ design for the worst) - building Russian dolls of redundancy: Netflix did not simply lift and shift its pre-AWS data center onto Amazon's infrastructure but actually re-designed it ground up for the Cloud. In doing so, outages of this nature were anticipated and critical design controls were built in to prepare for them on a regular basis that helped save their day. Some of these incremental lessons were listed in one of their earlier posts from 2010. One such principle is aptly called the "Rambo architecture" which Netflix describes as follows: "Each system has to be able to succeed, no matter what, even all on its own". They spread out their systems evenly across all Amazon's availability zones (AZ), all zones are active at all times (as opposed to having idle zones in a master/slave configuration) and finally each one has N+1 redundancy. Also they always chose reserved instances for the whole contract period (as opposed to on-demand ones) when provisioning new capacity, thus guaranteeing availability for their exclusive use. In effect, they created redundancies within default redundancies of the cloud until they were satisfied. Of course, it was more expensive than it could have been, but in their own words it was "money well spent since it makes our systems more resilient to failures".

2. The art of culling - It doesn't have to be all or nothing: Netflix consciously picked specific AWS services and used their own substitutes for ones they were not satisfied with. For instance, they chose to avoid Amazon's EBS (elastic block storage) dependencies since they found its performance was an issue. This also turned out to be the biggest factor working in their favor because the first point of failure at Amazon started with the EBS system choking out. However Netflix was indirectly (though marginally) affected by it since Amazon uses the EBS backed load balancers (ELB) which they used for first tier of load balancing. Now there of course is a possibility that this was purely co-incidental. The failure could have hinged upon any of the other services they use (such as ELB, S3 or Cassandra). But that is not the point. What matters is that they actually benchmarked Amazon services against alternatives to determine what elements should be outsourced and what should not.

3. Monkey Business - preempt failure scenarios: Netflix's Chaos Monkey is probably a glorified "ps -ef <randomized process id here> | kill -9 "... command (that would be an "End Task" on Windows) albeit on a cloud scale. But the concept behind it is as imaginative as the name itself. The only way to be sure that failure contingencies actually work is to simulate failure constantly and randomly. They had not prepared for the rare event of an entire AZ failure - which they now want to simulate with the Chaos Gorilla. In fact I would recommend they let loose a Chaos Yeti to test out whole service provider failures (yes they need to have multiple providers first).

4. When the going gets tough, (...): Once Netflix figured that Amazon was not going to recover past their peak traffic time of day, they started to manually reconfigure the ELB load balancers to avoid nodes within the affected AZ. This could have been automated to work seamlessly once it detected a threshold level of degradation inside an AZ. But ELB policies were not designed to handle this. So they did the next best thing - get with the service engineers, roll up their sleeves, and start a manual reconfiguration of the ELB endpoints to boycott the failing AZ. They eventually want to make zone fail over and recovery a one-click automated affair.

So that was four different things. What is Netflix's one little secret? I think the single core idea that differentiates Netflix was doing its own due diligence to develop its sourcing strategy and staying fully committed to it and relentlessly engaged thereafter to improve it. Some of this can be gleaned from their earlier blog post on why Netflix chose Amazon. It is very clear that Netflix has not merely hoped that the cloud will work as per specifications due to contractual controls but it has accepted the nascent nature of the business and taken the bull by its horns. The company has identified and bolstered its strengths (strategic service design) and inherent weaknesses (predicting demand). It has decided to outsource "the undifferentiated heavy lifting" while retaining a very capable IT architecture team in-house that has helped set itself apart from the crowd. This is a great example of selective outsourcing for what is traditionally thought of as a all or nothing service. The case above demonstrates that it can be done and when done well brings in great benefits.

Now can (and should) every company outsource on the cloud with such deep foresight and ingenuity? I don't believe there is a single answer. Its a question of the risk-reward trade offs each company chooses to make, given their specific situation and calls for a highly customized self-assessment.

Sourcing Secrets

Surviving Amazon cloudbursts - Netflix's Little Secret

No comments:

Post a Comment

Surviving Amazon cloudbursts - Netflix's Little Secret

No comments:

Post a Comment

Subscribe To