AWS Overheating Incident Exposes Cloud Concentration Risks in Northern Virginia

A single data center in northern Virginia went hot last week. Power failed. Servers shut down automatically. And suddenly trading apps went dark.

The Mashable report captured the sequence. A “thermal event resulting in a loss of power” struck one facility inside AWS’s use1-az4 Availability Zone in US-EAST-1. The problem began around 4:20 p.m. PDT on May 7. Within hours it impaired Elastic Compute Cloud instances and degraded Elastic Block Store volumes. Services that depend on them followed suit.

But this was no ordinary glitch. Amazon’s own status page laid it out plainly. “The issue was caused by a thermal event resulting in a loss of power,” it stated. Temperatures spiked. Cooling capacity dropped. Automated safeguards kicked in and took hardware offline. Recovery dragged into the next day. By early afternoon on May 8 cooling systems returned to pre-event levels. Most affected resources came back. A handful lingered.

Short pause. Then the ripple.

Coinbase lost trading for roughly seven hours. Users couldn’t log in properly, view balances or move assets. The cryptocurrency exchange eventually declared all markets re-enabled. Its primary issue stood resolved. FanDuel watched sports betting grind to a halt. Gamblers vented on social media about missed opportunities and frozen cashouts. The company acknowledged technical difficulties and pointed directly at the upstream cloud failure. Chartbeat, the web analytics firm, saw its own systems falter. Even humanitarian platform KoboToolbox reported its global instance offline for hours while its European setup stayed untouched.

Amazon moved fast to contain the blast. At 5:06 p.m. PDT it began shifting traffic away from the troubled zone for most services. That decision limited wider exposure. Yet dependent offerings still felt pain. Elastic Load Balancing, Elastic Kubernetes Service, ElastiCache, Redshift, OpenSearch and Managed Streaming for Apache Kafka all logged elevated error rates and latency. Some workflows simply stopped. Provisioning times stretched. Customers needing urgent recovery received blunt advice. Launch fresh resources in other zones. Restore from EBS snapshots. Don’t wait.

The Reuters story put the episode in sharper relief. Overheating has become a recurring headache for data center operators. Advanced servers running AI models and cloud workloads generate enormous heat. They demand massive electricity. Traditional air cooling sometimes falls short. Operators now experiment with water systems and specialized coolants thousands of times more efficient. Still, a rapid temperature spike inside one hall can cascade.

This incident marked the latest in a string. Last October a different AWS outage took down thousands of sites including Snapchat and Reddit. In November a CyrusOne cooling failure hit CME Group’s derivatives marketplace for one of its longest disruptions in years. Northern Virginia, home to the densest concentration of data centers on the planet, sits at the center of these events. US-EAST-1 handles a disproportionate share of global cloud traffic. Many architects treat it as the default region. That choice brings low latency for East Coast users. It also creates a single point of vulnerability that no amount of marketing copy can erase.

And yet progress came. AWS brought additional cooling capacity online incrementally. By late evening on May 7 early signs of recovery appeared. Racks returned to service one by one. Redshift clusters regained full function independently. Managed Streaming for Apache Kafka saw timeouts addressed through parallel fixes. The dashboard updates kept coming. Some read optimistic. Others admitted delays. “Progress is slower than originally anticipated,” one noted. Another conceded full recovery would take several hours. Transparency helped. Customers at least knew where to look.

Industry watchers reacted with familiar frustration. US-EAST-1 carries a reputation. Previous outages in 2021, 2023 and 2025 traced back to the same geography. Each time executives promise better redundancy. Each time organizations discover their failover architecture carried hidden assumptions. Multi-AZ setups protected many customers this week. Those who tied critical paths too tightly to a single zone paid the price. Coinbase’s CEO Brian Armstrong later called the outage unacceptable. He signaled a fresh review of infrastructure tradeoffs between speed and resilience. The message landed. Design choices that once looked optimal now invite second glances.

Physical risks inside data centers receive less attention than software bugs or configuration errors. This event changed the conversation for some. Servers don’t fail gracefully when ambient temperatures exceed design thresholds. They trip offline to protect silicon. Restoration requires careful sequencing. Bring one rack up too soon and heat builds again. The process demands patience. AWS spent the better part of 24 hours executing it.

So what comes next. Operators will examine their cooling redundancy with fresh eyes. Some may accelerate adoption of liquid cooling at scale. Others will push harder for geographic diversity among customers. Cloud buyers already pay premiums for multi-region architectures. After this week those premiums may feel less like insurance and more like table stakes.

The CNBC coverage highlighted immediate business consequences. FanDuel users complained loudly on X. Betting activity froze at a bad moment for NBA playoffs. Coinbase faced questions about whether its architecture had grown too optimized for one provider’s primary region. Both companies recovered. Trading resumed. Bets settled. Yet the episode left a mark. Trust in any single cloud provider’s availability zone now carries visible fine print.

AWS itself offered no grand promises in its post-mortem. It described measured steps. Stabilized cooling. Restored the majority of instances. Continued work on the rest. That restraint carried weight. The company knows its scale. It also knows one overheated room in Virginia can touch millions of users across unrelated industries. The internet runs on a handful of physical locations no matter how virtual the marketing makes it sound.

Analysts tracking cloud concentration have warned for years. This incident supplies fresh data. When one facility loses power due to heat the effects reach far beyond its walls. Crypto exchanges. Sportsbooks. Analytics dashboards. Humanitarian databases. All felt the same thermal spike translated through shared infrastructure. The pattern suggests organizations should treat regional dependency as a strategic risk rather than an operational detail.

Recovery wrapped up late on May 8. Outage reports on Downdetector dropped from nearly 600 at peak to under 100. Most services returned to normal. A small number of lingering impaired volumes received individual attention through the AWS Health Dashboard. Support teams reached out directly to affected accounts. The system worked as designed. But the design itself showed strain.

Northern Virginia’s data center corridor continues to grow. Demand for compute shows no sign of slowing. Heat output will only increase. The question now centers on whether operators and their customers will adjust habits before the next thermal event strikes. Some already are. Others will wait for the next headline. History suggests the interval won’t be long.

Notice an error?

Ready to get started?