Facebook Explains Outage
Facebook was down for over 2.5 hours for some users, according to a post from the company. A post in Facebook’s engineering notes says:
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
You can see more of the technical details here. Facebook has turned off the system that attempts to correct configuration values, and is exploring new designs for it.
I attended a screening of the movie The Social Network last night, and Mark Zuckerberg’s character stressed how much downtime would hurt the reputation of the site, as he was getting it launched. I thought that was kind of funny, considering the timing.