Why You May Have Gone a Couple Hours Without Gmail
If you use Gmail, know someone who use Gmail, use Facebook or Twitter, or read blogs or online news, you probably know that Gmail had some problems yesterday. The service experienced a widespread outage, which lasted for about 100 minutes according to Google.
It’s amazing how much frustration can be vented over such a small period of time, but bloggers and Twitterers alike voiced their outrage/concern for the world to see. This is to be expected though. While Gmail is usually reliable, this seems to be a more common occurrence in recent months. It’s not a great sign, when the company once guaranteed 99.9% uptime for the service and Google Apps in general.
"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there’s a problem with the service," said Ben Treynor, Gmail’s VP Engineering and Site Reliability Czar on the Gmail blog. "Thus, right up front, I’d like to apologize to all of you — today’s outage was a Big Deal, and we’re treating it as such. We’ve already thoroughly investigated what happened, and we’re currently compiling a list of things we intend to fix or improve as a result of the investigation."
So what happened to Gmail yesterday? They took a "small fraction" of Gmail’s servers offline to perform routine upgrades as usual. Treynor explains:
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we’re too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
He says the Gmail engineering team was alerted of the failures within seconds, and they addressed the problem. Read Treynor’s post for the technical details.
Google says they have turned their full attention to preventing this kind of thing in the future, and will be working hard over the next few weeks to implement reliability improvements. They then threw that 99.9% number around again, saying, "Gmail remains more than 99.9% available to all users."