AWS Software Bug Sparks 14-Hour Outage, Disrupting Snapchat and Reddit

On October 20, 2025, a software bug in AWS's automation systems caused a 14-hour outage in the US-EAST-1 region, deleting DynamoDB IP addresses and overwhelming DNS, disrupting services like Snapchat, Reddit, and Ring for millions. Amazon attributed it to faulty automation, prompting code reviews and calls for improved cloud resilience.
AWS Software Bug Sparks 14-Hour Outage, Disrupting Snapchat and Reddit
Written by Dave Ritchie

In the early hours of October 20, 2025, a cascading failure rippled through Amazon Web Services, bringing down a swath of the internet’s most relied-upon platforms. From social media giants like Snapchat and Reddit to everyday tools such as Ring doorbells and Fortnite games, the outage disrupted services for millions, underscoring the fragility of cloud infrastructure that powers much of modern digital life. Amazon later attributed the chaos to a rare software bug in its automation systems, a revelation that has prompted industry experts to scrutinize the company’s operational safeguards.

The incident began innocuously enough in AWS’s US-EAST-1 region, a critical hub for many global services. According to a detailed post-mortem released by Amazon, the bug emerged during routine maintenance when automated software erroneously deleted IP addresses tied to the DynamoDB database service. This misstep prevented connections to the regional endpoint, triggering widespread connectivity issues that lasted over 14 hours.

Unraveling the Technical Cascade

As the deletion propagated, it overwhelmed AWS’s Domain Name System (DNS) infrastructure, which struggled to handle the sudden surge in traffic rerouting requests. Engineers at Amazon described how the automation tool, designed to manage scaling and failover, instead amplified the problem by repeatedly attempting fixes that compounded the errors. Reports from The Guardian highlighted how this led to a domino effect, taking offline not just customer applications but also internal Amazon systems, including parts of its e-commerce platform.

The outage’s scale was staggering: Downdetector logged tens of thousands of user complaints, with peaks in reports from major cities reliant on AWS. Businesses from banking to smart home devices found themselves paralyzed, as the failure exposed dependencies on a single provider’s ecosystem. Amazon’s explanation, echoed in coverage by CNET, likened the event to a traffic jam where one stalled vehicle blocks an entire highway, illustrating the interconnected risks in cloud architecture.

Lessons from the Fallout

In the aftermath, Amazon deployed manual interventions to restore services, gradually rebuilding the affected IP mappings and scaling up DNS capacity to absorb the load. The company emphasized that while the bug was rare, it revealed gaps in their automation logic, prompting immediate code reviews and enhanced monitoring protocols. Insights from BBC News noted that over 1,000 companies were impacted, affecting millions of users and raising questions about redundancy in critical systems.

Industry insiders are now debating the broader implications for cloud reliability. With AWS commanding a significant share of the market, this event has fueled calls for diversified hosting strategies to mitigate single points of failure. Amazon’s report, as detailed in GeekWire, admitted that while automated systems are essential for efficiency, they can introduce unforeseen vulnerabilities when not rigorously tested against edge cases.

Path Forward for Cloud Resilience

Looking ahead, Amazon has committed to bolstering its automation frameworks, including more robust simulation testing to preempt such bugs. The incident echoes past outages, like the 2021 Fastly disruption, reminding providers that even sophisticated tools require human oversight. Coverage in Tom’s Guide pointed out that while services returned to normal by late October 20, the multi-hour downtime incurred substantial economic costs, estimated in the hundreds of millions for affected enterprises.

For technology leaders, this serves as a stark reminder of the need for contingency planning in an era of hyper-dependence on cloud services. As AWS refines its processes, the episode may accelerate innovations in fault-tolerant designs, ensuring that future bugs don’t cascade into global disruptions. Ultimately, while Amazon’s transparency in disclosing the root cause—via sources like Engadget—builds trust, it also highlights the ongoing challenges of maintaining uptime in increasingly complex digital ecosystems.

Subscribe for Updates

CloudPlatformPro Newsletter

The CloudPlatformPro Email Newsletter is the go-to resource for IT and cloud professionals. Perfect for tech leaders driving cloud adoption and digital transformation.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.

Notice an error?

Help us improve our content by reporting any issues you find.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us