The Twitter Fail Whale and Global Optimization
I was looking for a good example to explain global vs. local optimization, and lo, one fell right out of the twittersphere at me. It came from the Twitter engineering team themselves. Ed Ceaser (@asdf) and Nick Kallen (@nk) posted a blog entry recently, entitled “The Anatomy of the Whale”. The entry discusses the team’s efforts to track down a capacity problem that caused too many people to see the ‘fail whale’, Twitter’s visual representation of a HTTP 503 Service Unavailable error. As the guys say:
“Discovering the root cause can be very difficult because Whales are an indirect symptom of a root cause that can be one of many components. In other words, the only concrete fact that we knew at the time was that there was some problem, somewhere.”
It struck me then that finding a performance bottleneck is a fantastic example of a global process optimization problem. Any seasoned developer knows that the process for finding performance issues is real detective work. In my years as a developer, I learned to recognize performance blind alleys, or a red herring (to name but two clichés), such as investing time in optimizing one part of an end-to-end process. Ed and Nick offer the following great advice for any optimization effort: “Focus on the biggest contributor to the problem”. Tracking down the biggest contributor to the problem is where the need for visibility comes in.
Gaining visibility into any process has a number of aspects:
Measuring local things credibly
Aggregating the local measurements into an end-to-end picture
Presenting the metrics visually to gain insight into their relationships
The twitter team describes this measurement data problem as:
“Debugging performance issues is really hard. But it’s not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter’s performance at any given moment. Investigating performance problems in this world is more of an art than a science. It’s easy to confuse causes with symptoms and even the data recording software itself is untrustworthy.”
Visualizing the performance data, the Twitter team discovered that their problem was the decay in throughput of data being delivered during peak loads from their distributed caching subsystem, based on Memcached. Armed with this information, they tackled the problem in two ways: reduce the volume of calls to Memcached (they found 7 out of 17 calls to Memcached were unnecessary), and beef up the Memcached cluster.
There is further point to make about the Twitter blog post: it is another example of the Twitter team doing their engineering in public. This transparency gives them credibility (Amazon and Salesforce.com, are you listening?) and positively affects Twitter’s relationship with their customers, potential paying customers, and investors. I have blogged about the work of the Twitter engineering team before. Their commitment to transparency continues to impress me. I’d rather hear my service providers saying “look we have a problem, here’s how we measured it, and here are the steps we are taking to resolve it”, rather than operate on a “Wizard-of-Oz-behind-the-magic-curtain” basis.