Twitter On Outage: Not How We Wanted Today To Go
On Thursday, Twitter experienced a massive amount of downtime for all web users, and mobile users were experiencing some additional problems.
Twitter had initially updated its status blog to say, “Users may be experiencing issues accessing Twitter. Our engineers are currently working to resolve the issue.”
The company later updated that to say, “Update: The issue has stabilized and all services are restored.”
Unfortunately, that was not the end of it, as outages continued.
While everything seems to be back in order (fingers crossed), Twitter’s Mazen Rawashdeh ended up writing on the official Twitter blog:
Not how we wanted today to go. At approximately 9:00am PDT, we discovered that Twitter was inaccessible for all web users, and mobile clients were not showing new Tweets. We immediately began to investigate the issue and found that there was a cascading bug in one of our infrastructure components. This wasn’t due to a hack or our new office or Euro 2012 or GIF avatars, as some have speculated today. A “cascading bug” is a bug with an effect that isn’t confined to a particular software element, but rather its effect “cascades” into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter.
We began recovery at around 10:10am PDT, dropped again around 10:40am PDT, and then began full recovery at 11:08am PDT. We are currently conducting a comprehensive review to ensure that we can avoid this chain of events in the future.
For the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, twitter.com has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today though.
It’s true that users are seeing the Fail Whale and experiencing other issues that might surface such an image a lot less than they used to.