Blog Posts Process Management Service Oriented Architecture (SOA)

The Twitter Fail Whale and Global Optimization

RichardWatson_jpg

Blogger: Richard Watson

I was looking for a good example to explain global vs. local optimization, and lo, one fell right out of the twittersphere at me. It came from the Twitter engineering team themselves. Ed Ceaser (@asdf) and Nick Kallen (@nk) posted a blog entry recently, entitled “The Anatomy of the Whale”. The entry discusses the team’s efforts to track down a capacity problem that caused too many people to see the ‘fail whale’, Twitter’s visual representation of a HTTP 503 Service Unavailable error. As the guys say:

“Discovering the root cause can be very difficult because Whales are an indirect symptom of a root cause that can be one of many components. In other words, the only concrete fact that we knew at the time was that there was some problem, somewhere.”

It struck me then that finding a performance bottleneck is a fantastic example of a global process optimization problem. Any seasoned developer knows that the process for finding performance issues is real detective work. In my years as a developer, I learned to recognize performance blind alleys, or a red herring (to name but two clichés), such as investing time in optimizing one part of an end-to-end process. Ed and Nick offer the following great advice for any optimization effort: “Focus on the biggest contributor to the problem”. Tracking down the biggest contributor to the problem is where the need for visibility comes in.

Show me that call stack again, Watson

Visibility

Gaining visibility into any process has a number of aspects:

The twitter team describes this measurement data problem as:

“Debugging performance issues is really hard. But it’s not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter’s performance at any given moment. Investigating performance problems in this world is more of an art than a science. It’s easy to confuse causes with symptoms and even the data recording software itself is untrustworthy.”

Visualizing the performance data, the Twitter team discovered that their problem was the decay in throughput of data being delivered during peak loads from their distributed caching subsystem, based on Memcached. Armed with this information, they tackled the problem in two ways: reduce the volume of calls to Memcached (they found 7 out of 17 calls to Memcached were unnecessary), and beef up the Memcached cluster.

Transparency

There is further point to make about the Twitter blog post: it is another example of the Twitter team doing their engineering in public. This transparency gives them credibility (Amazon and Salesforce.com, are you listening?) and positively affects Twitter’s relationship with their customers, potential paying customers, and investors. I have blogged about the work of the Twitter engineering team before. Their commitment to transparency continues to impress me. I’d rather hear my service providers saying “look we have a problem, here’s how we measured it, and here are the steps we are taking to resolve it”, rather than operate on a “Wizard-of-Oz-behind-the-magic-curtain” basis.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/the-twitter-fail-whale-and-global-optimization/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×