When the cloud stopped: Read more about the AWS failure of 20.10.2025

It was a Monday that many IT managers will remember for a long time to come. On October 20, 2025, Amazon Web Services experienced a massive outage in the most important region of Northern Virginia (US-EAST-1), which not only paralyzed numerous enterprise applications, but also caused absurd effects: Even owners of networked mattresses felt the consequences. Now, Amazon has released a detailed post-mortem report showing how a tiny race condition in a chain reaction could bring a huge cloud system to a standstill.

The domino effect: Three Stages of Chaos

The outage began on Sunday evening at 11:48 p.m. Pacific Time (8:48 a.m. our time on Monday) and lasted until the afternoon in various forms. Amazon distinguishes three main phases, some of which overlapped and reinforced each other.

In the first phase, which lasted from 8:48 a.m. to 11:40 a.m., Amazon's NoSQL database DynamoDB delivered massively increased error rates for API access. That alone would have been dramatic enough, because DynamoDB is a central component for countless AWS services. But it should get worse.

The second phase developed between 2:30 p.m. and 11:09 p.m.: The Network Load Balancer (NLB) began to produce increased connection errors. This was due to failed health checks in the NLB fleet, which resulted in functioning servers being pulled out of circulation while malfunctions remained in the system.

The third and perhaps most noticeable phase for many users was the launch of new EC2 instances. From 11:25 a.m. to 7:36 p.m. it just didn't work out. Even when instances started again at 7:37 p.m., they struggled with connection problems until 10:50 p.m.

The root of all evil: A treacherous race condition

What happened now? Amazon calls it a "latent defect" in DynamoDB's automated DNS management system. This sounds harmless at first, but it has had fatal consequences. To understand what went wrong, you have to dive a bit into the architecture.

Services like DynamoDB manage hundreds of thousands of DNS records to power their vast, heterogeneous fleet of load balancers in each region. The DNS system enables seamless scalability, fault isolation, low latencies and local access. Automation is essential for adding additional capacity, handling hardware failures, and efficiently distributing traffic.

Amazon's DNS management system for DynamoDB is divided into two independent components for availability reasons. The DNS Planner monitors the health and capacity of the load balancers and periodically creates new DNS plans for each endpoint of the service. These plans consist of a collection of load balancers with corresponding weights. The DNS Enactor, on the other hand, has minimal dependencies and implements these DNS plans by making the necessary changes in Amazon Route53. For resilience, the DNS Enactor runs redundantly and completely independently in three different Availability Zones.

Each of these independent instances of the DNS Enactor searches for new plans and tries to update Route53 by replacing the current plan with a new plan via a Route53 transaction. This ensures that each endpoint is updated with a consistent plan, even if multiple DNS enactors are trying to perform updates at the same time.

And here is the problem: The race condition was created by an unlikely interaction between two DNA enactors. Normally, a DNS Enactor takes the latest plan and works through the service endpoints to apply that plan. Before applying a new plan, he checks once whether his plan is newer than the previously applied plan. As it goes through the list of endpoints, there may be delays when another DNS Enactor is just updating the same endpoint. In such cases, the DNS Enactor will retry each endpoint until the plan has been successfully applied to all endpoints.

Shortly before the outage, a DNS Enactor experienced unusually high delays and had to try its updates repeatedly at several DNS endpoints. As he slowly worked his way through the endpoints, several other things happened at the same time: The DNS Planner continued to run and produced many newer generations of plans. Another DNA Enactor then began to apply one of these newer plans and quickly got through all the endpoints.

The timing of these events triggered the latent race condition. When the second Enactor (who applied the latest plan) completed his endpoint updates, he started the plan cleanup process, which identifies plans significantly older than the one just applied, and deletes them. It was at this point that the first Enactor (unusually delayed) applied its much older plan to the regional DynamoDB endpoint, overriding the newer plan. The audit at the beginning of the plan application process was now outdated due to the unusually high delays and did not prevent the older plan from overwriting the newer one.

The cleanup process of the second enactor then erased this older plan because it was many generations older than the plan he had just applied. When this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, deleting the active plan put the system in an inconsistent state that prevented subsequent plan updates from being applied by any DNS enactors. This situation ultimately required manual intervention by operators.

The result? The DNS record for ‘dynamodb.us-east-1.amazonaws.com’ was suddenly empty. All systems that wanted to connect to DynamoDB in Northern Virginia immediately ran against DNS errors. This affected both customer traffic and traffic from internal AWS services that rely on DynamoDB.

The consequences: When the infrastructure fails

At 9:38 a.m., the technicians identified the error in DNS management. Initial temporary countermeasures took place at 10:15 a.m. and enabled some internal services to reconnect to DynamoDB. This was important to unlock critical internal tools needed for further recovery. By 11:25 a.m., all DNS information was restored.

But the crisis was far from over. The EC2 instances still didn't want to start. The reason for this was the DropletWorkflow Manager (DWFM), which is responsible for managing all the underlying physical servers that EC2 uses to host EC2 instances. Amazon calls these servers internally ‘droplets’.

Each DWFM manages a number of droplets within each Availability Zone and maintains a lease for each droplet currently under its management. This lease allows the DWFM to track droplet status and ensure that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the operating system of the EC2 instance, result in the correct status changes in the broader EC2 systems. As part of this lease management, each DWFM host must check in and perform a status check every few minutes on each droplet they manage.

However, this process depends on DynamoDB. When DynamoDB was unavailable, these status checks began to fail. While this did not affect running EC2 instances, it meant that the droplet had to establish a new lease with a DWFM before further instance status changes could happen for the EC2 instances it hosts. Between 11:48 p.m. and 2:24 a.m., the leases between DWFM and Droplets in the EC2 fleet began to slowly decline.

When DynamoDB was available again at 2:25 a.m. Pacific time (11:25 a.m. of our time), DWFM began restoring leases with droplets across the EC2 fleet. Since each droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs returned ‘insufficient capacity errors’ for new incoming EC2 launch requests.

There was a perfidious problem here: Due to the large number of droplets, attempts to establish new droplet leases took so long that the work could not be completed before they ran again in time-outs. Additional work was queued to try to establish the droplet lease again. At this point, DWFM had entered a state of congestive collapse and could no longer make any progress in recovering droplet leases.

Since there was no established operational recovery procedure for this situation, the engineers proceeded cautiously to solve the problem with DWFM without causing any further problems. After several mitigation attempts, the engineers throttled incoming work at 4:14 a.m. Pacific time and began selective restarts of DWFM hosts. Restarting the DWFM hosts cleared the DWFM queues, reduced processing times and enabled the establishment of droplet leases. At 5:28 a.m., DWFM had established leases with all droplets in the Northern Virginia region, and new launches began to succeed again, although many requests still saw ‘request limit exceeded’ errors due to the request throttling introduced.

The network manager: When networking is lagging behind

But even then, the problems were not over. When a new EC2 instance is launched, a system called Network Manager propagates the network configuration that allows the instance to communicate with other instances within the same Virtual Private Cloud (VPC), other VPC network devices, and the Internet.

At 5:28 a.m. Pacific Time (14:28 p.m. of our time), shortly after DWFM was restored, the Network Manager began to propagate updated network configurations to newly launched instances and instances that had been terminated during the event. Since these network propagation events had been delayed by the DWFM issue, a significant backlog of network status propagations had to be processed by the Network Manager in the Northern Virginia region.

As a result, at 6:21 a.m., the Network Manager began experiencing increased latencies in network propagation times while working to process the backlog of network status changes. While new EC2 instances could be successfully launched, they did not have the necessary network connectivity due to network status propagation delays. The engineers worked to reduce the load on the Network Manager to address the propagation times of the network configuration, and took action to speed up recovery. At 10:36 a.m., network configuration propagation times had returned to normal, and new EC2 instance starts were back to normal.

Network load balancer: The health care system is getting sick

Delays in network status propagation for newly launched EC2 instances also impacted the Network Load Balancer (NLB) service and AWS services that use NLB. Between 5:30 a.m. and 2:09 a.m. Pacific Time on October 20, some customers experienced increased connection errors with their NLBs in the Northern Virginia region.

NLB is built on a highly scalable, multi-tenant architecture that provides load balancing endpoints and routes traffic to backend destinations that are typically EC2 instances. The architecture also uses a separate health check subsystem that regularly performs health checks against all nodes within the NLB architecture and removes any nodes from the service that are considered unhealthy.

During the event, the NLB health checking subsystem began to experience increased health check errors. This was caused by the health checking subsystem commissioning new EC2 instances, while the network status for these instances was not yet fully propagated. This meant that in some cases, health checks failed, although the underlying NLB node and backend targets were healthy. This led to health checks switching between failed and healthy. This caused NLB nodes and backend targets to be removed from the DNA only to be put back into operation at the next successful health check.

Alternating health check results increased the load on the health check subsystem and caused its degradation, causing delays in health checks and triggering automatic AZ DNA failover. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors when the remaining healthy capacity was insufficient to bear the application load.

At 9:36 a.m., the engineers disabled automatic health check failover for NLB, allowing all available healthy NLB nodes and backend targets to be put back into operation. This solved the increased connection errors to affected load balancers. Shortly after EC2 recovered, they reactivated automatic DNS health check failover at 2:09 a.m. Pacific time.

The impact: A rat tail of problems

The DynamoDB outage and the resulting issues have had far-reaching consequences for many other AWS services. Lambda functions delivered API errors and latencies between 11:51pm on October 19 and 2:15pm Pacific Time on October 20. Initially, the DynamoDB endpoint issues prevented the creation and update of functions, caused processing delays for SQS/Kinesis event sources, and call errors.

Amazon Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate experienced container launch errors and cluster scaling delays between 11:45pm on October 19 and 2:20am Pacific time on October 20.

Amazon Connect customers experienced increased errors handling calls, chats, and cases between 11:56pm on October 19 and 1:20am Pacific Time on October 20. Incoming callers heard busy characters, error messages, or experienced failed connections. Both agent-initiated and API-initiated outgoing calls failed.

The AWS Security Token Service (STS) experienced API errors and latency between 11:51 p.m. and 9:59 a.m. Customers attempting to log in to the AWS Management Console with an IAM user experienced increased authentication errors due to underlying DynamoDB issues between 11:51pm on October 19 and 1:25am Pacific Time on October 20.

Amazon Redshift customers experienced API errors creating and modifying Redshift clusters or running queries against existing clusters between 11:47pm on October 19 and 2:21am Pacific Time on October 20. Interestingly, Redshift customers in all AWS regions were unable to use IAM user credentials to run queries between 11:47pm on October 19 and 1:20am on October 20, because a Redshift defect used an IAM API in the Northern Virginia region to resolve user groups.

The teachings: What Amazon is Changing

Amazon has already taken several measures and plans to make further changes to prevent recurrence. The DynamoDB DNS Planner and DNS Enactor have been disabled worldwide. Before reactivating this automation, Amazon will fix the race condition scenario and add additional safeguards to prevent the application of false DNS plans.

For the Network Load Balancer, a speed control mechanism is added that limits the capacity that a single NLB can remove if health check errors cause an AZ failover. This is to prevent too much capacity from being withdrawn from circulation at once.

For EC2, Amazon is building an additional testing suite to complement its existing scaling tests. This will run through the DWFM recovery workflow to identify future regressions. In addition, the throttling mechanism in the EC2 data propagation systems is improved to limit incoming work based on the size of the queue and protect the service during periods of high load.

Conclusion: When redundancy is not enough

This outage impressively demonstrates how complex modern cloud infrastructures are and how a seemingly small vulnerability – a race condition that is unlikely to occur under normal circumstances – can lead to a cascade of outages. Despite all the redundancies, despite triple-designed DNS enactors in different Availability Zones, despite sophisticated automation, an unfortunate timing could bring down the entire system.

In its final report, Amazon said it apologized for the impact on its customers. We know how critical the services are to customers, their applications, end users and their businesses. We will do everything we can to learn from this event and use it to improve availability even further.

For us as users, the insight remains: The cloud may be robust, but it's not infallible. Multi-region strategies, disaster recovery plans and fallback mechanisms are not paranoia, but necessity. And sometimes it is the networked mattress that reminds us how dependent we have become on this invisible infrastructure.

92 is half of 99

When the cloud stopped: Read more about the major AWS outage of October 20, 2025