Or: Why 'unintended configuration changes' should become the new buzzword of the year
Did you think after the Major AWS Disaster of October 20 (and the Detailed technical review) Would we have learned our lesson? Haha, it would be nice! Not nine days later, yesterday, October 29, 2025, Microsoft proved that you can paralyze the Internet even without DNS race conditions.
All you need is an ‘unintentional configuration change’ in Azure Front Door, and the digital world is already standing still. Welcome to the second act of cloud chaos in October!
Timeline: An Afternoon in Digital Chaos
17:00 CET – It starts
Around 16:00 UTC (5:00 p.m. with us) the first reports began: Microsoft services no longer responded, or only very sluggishly. What initially looked like a small hiccup quickly turned out to be a full-blown disorder.
17:06 CET - Microsoft detects the problem
Microsoft released the first official error message in the admin center with the problem ID MO1181369. The affected services read like a best-of list of the Microsoft cloud:
- Exchange Online (bye bye, emails!)
- Microsoft 365 suite (Excel, Word, PowerPoint in coma)
- Microsoft Defender XDR (Safety? What security?)
- Microsoft Entra (formerly Azure AD – Authentication down!)
- Microsoft Intune (Device Management ade)
- Microsoft Purview (Compliance nightmare)
- Power apps (All your custom apps: dead)
Particularly spicy: Also this Microsoft 365 admin center itself was affected. It's like the fire brigade is burning while fires are breaking out everywhere. Admins stood there and literally could do nothing but look helpless.
17:21 CET – The first analysis
Microsoft announced: "We are investigating reports of a problem affecting Microsoft Azure and Microsoft 365 services." Don't panic, everything under control! (Spoiler: It wasn't.)
17:28 CET – DNS strikes again!
And there it was again, the old admin trauma: ‘It’s always DNS!’ Microsoft confirmed that DNS problems were the cause. Specifically, it concerned the network and hosting infrastructure, which was in an ‘unhealthy state’.
For the non-technicians among you: DNS (Domain Name System) is the Internet's telephone directory. If that doesn't work, computers can't talk to each other anymore because they don't know where to find each other. No DNS, no Internet. It's that simple.
17:36 CET - Transport is diverted
Microsoft tried to redirect traffic to alternative, healthy infrastructure. It's like trying to divert all cars over dirt roads when there's a traffic jam on the highway. Sounds good in theory...
18:17 CET - The cause is found
Now it became concrete: “We’ve identified a recent configuration change to a portion of Azure infrastructure which we believe is causing the impact.”
An "unintentional configuration change" – this is cloud talk for: “Someone pushed a wrong button somewhere.” The problem was specifically related to: Azure front door, Microsoft's Content Delivery Network (CDN).
18:24 UTC – Rollback is initiated
Microsoft started deploying the last known good configuration, i.e. rolling back to the last working setting. Estimated duration: 30 minutes. (Spoiler No. 2: It took a lot longer.)
At the same time, Microsoft temporarily blocked all customer configuration changes to avoid further chaos. Imagine you're trying to put out a burning house while people keep dragging furniture in.
19:57 UTC - First signs of improvement
The rollback was complete, and Microsoft began to restore nodes and route traffic through healthy nodes. Expected full recovery: until 23:20 UTC (00:20 CET). Wait another four hours.
Shortly after 02:00 CET (30 October) – all-clear
After over 8 Hours of Failure Microsoft has fixed the problem. Eight hours! In the digital world, half an eternity.
What is Azure Front Door?
Before we dive deeper, a brief explanation for those who don't have to deal with cloud infrastructure every day:
Azure front door Microsoft's Global Content Delivery Network (CDN) and Application Delivery Network (ADN). Simply put: It is the ‘entrance door’ for virtually all Azure and Microsoft 365 services worldwide.
Front Door performs several critical tasks:
- Load balancing: Distributes incoming traffic to different servers
- caching: Stores frequently retrieved content in between so that it loads faster
- DDoS protection: Filters out attacks and bots
- SSL termination: Decrypts encrypted connections
- routing: Directs requests to the geographically closest or least busy servers
Failure of the front door is like knocking down the main gate of a huge building complex – no one comes in, no matter how important the concern is.
The technical dimension: What exactly happened?
The following scenario can be reconstructed from the official status reports and reports:
Phase 1: The fatal configuration change
Sometime before 16:00 UTC, a configuration change was made in the Azure Front Door infrastructure. Microsoft calls it ‘inadvertent’ – which probably means that either:
- An automated process has made a faulty change
- A manual change had unexpected side effects
- A deployment process has gone wrong
This change caused DNS problems. Specifically, this means: The DNS records that tell clients where to find the Azure services were suddenly incorrect, incomplete, or no longer present.
Phase 2: The Cascade Begins
Because Front Door acts as a central component, a chain reaction began:
- Primary services affected: Outlook, Microsoft 365, Exchange Online were directly affected
- Admin tools are down: The Microsoft 365 Admin Center and Azure Portal were partly unavailable – the tools that admins need to solve problems
- Authentication is Failing: Microsoft Entra (Azure AD) had problems, which meant that many users could not log in at all
- Security tools down: Microsoft Defender XDR and Microsoft Purview were affected – security and compliance were literally blind
Phase 3: Trying to save the portal
Microsoft took an interesting step: They ‘failed the portal away from AFD’, i.e. redirected the Azure Portal to bypass Front Door and be directly accessible. This worked in part, but some portal extensions (like the Marketplace) remained problematic.
This is like attaching an emergency ladder to a burning building – it works, but only to a limited extent.
Phase 4: The rollback marathon
Rolling back to the last working configuration took hours. Why so long? Because Azure Front Door is distributed globally and the changes had to be propagated across hundreds of servers in dozens of data centers worldwide.
During the rollback, the technicians had to:
- Identifying the ‘last known good configuration’
- Deploy this configuration (30+ minutes)
- Restoring knots piece by piece
- Gradually route traffic through healthy nodes
- Monitor that no more is broken
Collateral damage: Who was all affected?
Airlines in chaos
Alaska Airlines and Hawaiian Airlines They reported that they did not have access to critical systems due to Azure issues. The websites of the airlines were down, the online check-in did not work. Passengers had to line up in long queues at the airport and be checked in manually.
Imagine: You're at the airport, your flight leaves in an hour, and suddenly all passengers have to be checked in manually because the cloud isn't working. Welcome to the 1990s!
Retail and Gastronomy
In the U.S., several major chains reported problems:
- Kroger (sanitary equipment manufacturer)
- Costco (wholesaler)
- Starbucks (coffee house chain)
At Starbucks, this meant: The mobile app didn't work, Mobile Payment was dead, and staff had to resort to old manual systems.
Gaming and Entertainment
- Xbox Live: Players could not log in, multiplayer games could not be reached
- Minecraft: Again! After the AWS outage, the Azure outage. The Minecraft community had a black October.
Business-critical services
Particularly painful was the failure for professional users:
CodeTwo (email signature management) reported global performance issues in several regions:
- Germany West Central
- Australia East
- Canada East
- And 13 other components
SpeechLive (Cloud dictation solution for lawyers and doctors) was completely down. Imagine you're a doctor, urgently need to dictate patient records, and your cloud software is on strike. Not a good situation.
TeamViewer (web.teamviewer.com) was affected – remote support became a challenge.
The German Perspective
In Germany, too, there were effects that went beyond the direct Microsoft services:
- Various ISPs (1&1, Vodafone Cable) reported increased fault messages – presumably because many users thought their internet was broken even though it was ‘only’ the cloud
- Some users reported that even pages not hosted on Azure were loading slower - an indication of how far the DNS issues went
- The blog BornCity.com had short-term outages despite being hosted at all-inkl.com – possibly due to DNS propagation issues
AWS: The same game a week ago.
Let's look back to October 20, 2025. In the morning at 9:30 am German time began the great trembling: AWS, the world's largest cloud provider, experienced massive problems in the US-EAST-1 region. And because this region is so central, practically half the internet was down.
The domino effect
The list of affected services reads like a who’s who of the internet:
- signal, Snapchat, Zoom, Slack
- Fortnite, Roblox, Minecraft (Yes, again)
- Tinder (No date for you!)
- Amazon Prime video, Alexa
- Coinbase, Robinhood, Venmo
- Perplexity AI, Canva, Duolingo
- Autodesk (local installations did not work because the license servers were not reachable)
- In Germany: The Gematics had TI disruptions in eRecipe and ePA because health insurers used AWS
About 8.1 million complaints went in, more than 2,000 websites and apps were affected. Even "Eight Sleep", a smart bed system that automatically adjusts temperature and inclination, no longer worked. People couldn't even sleep comfortably anymore!
The technical cause: A race condition
What was the cause? A so-called race condition in the AWS DNS system. Two automated processes simultaneously tried to make changes in different regions, and – Puff! – the entire DNS table was empty. The servers suddenly didn't know how to communicate with each other.
The primary affected service was DynamoDB, a database service that AWS also uses internally. When DynamoDB failed, it was followed by a cascade: EC2 (virtual servers) and Lambda (serverless code) were also affected. A classic single point of failure.
AWS took about three hours to find and fix the cause. But the aftermath was felt hours later.
The big picture: Cloud dependency as a risk
Two massive failures within nine days. Both times the same root cause: DNS problems in central cloud infrastructures. What can we learn from this?
1. Single Point of Failure is Real
No matter how big and powerful a cloud provider is, if it fails, half the internet is often involved. AWS and Azure are so dominant that their outages have a global impact. Together, AWS, Microsoft Azure and Google Cloud control about 65% the global cloud market. This is a huge concentration of power.
2. Multi-cloud is not a luxury but a duty
Experts have long warned: Anyone who relies on a cloud provider for all their services takes a huge risk. Multi-cloud strategies, where you spread your infrastructure across multiple providers, are essential today. Yes, this is more complex and expensive – but an eight-hour outage can cost you much more.
3. Failover strategies need to be
Do you have a plan B? And a plan C? Businesses need:
- Automatic failover systems, switching to alternative infrastructure in the event of failures
- Redundant backups on various platforms
- CDNs with multiple origins, so that content can be delivered from different sources
- Regular tests your emergency plans (not only when it's on fire!)
4. DNA remains the Achilles' heel
Both failures had DNS problems as a cause. The Domain Name System is the nervous system of the Internet – if it fails, chaos is pre-programmed. Companies should:
- Using Distributed DNS Strategies
- Using Multiple DNS Providers
- Configure DNS caching intelligently
5. The human component
In both cases, it was ‘unintentional configuration changes’ or automated processes that got out of control. This shows: Even with the tech giants, the complexity of the systems is so high that mistakes happen. And when they happen, they have global repercussions.
Microsoft users sarcastically commented: ‘If it is not broken, don’t fix it!’ The old adage ‘If it’s not broken, don’t fix it’ seems to have been forgotten by Microsoft and Co.
What does this mean for you?
Whether you're running a business, an IT admin, or just using cloud services, these outages are a wake-up call:
For companies:
- Diversify your cloud infrastructure. Don't put everything on one card.
- Test your emergency plans regularly. If AWS or Azure fails, do you know what to do?
- Communicate proactively with your customers when problems arise. Transparency creates trust.
- Maintains critical functions locally. Not everything has to be in the cloud.
For private users:
- Have backup solutions for important services. If Outlook is down, can you access your email program or webmail?
- Uses different platforms for different purposes. All the eggs in a basket is never a good idea.
- Downloads important data locally. The cloud is convenient, but not a replacement for on-premises backups.
For the policy:
The EU is already working on stricter regulations such as the Cyber Resilience Act, the NIS 2 Directive and the Cyber Solidarity Regulation. These laws are designed to ensure that critical infrastructures are better protected. ISACA’s AWS expert Chris Dimitriadis speaks of ‘digital pandemics’ – and that is exactly how these failures feel.
Conclusion: Welcome to the fragile digital world
Two massive cloud outages within nine days clearly show us one thing: The modern digital infrastructure is more fragile than we want to perceive. We've become dependent on a handful of tech giants, and when they stumble, we all stumble along.
The good news? These failures are avoidable – or at least their impact can be minimised. It needs:
- Technical diversification (Multi-cloud, multi-region, multi-provider)
- Organisational resilience (Emergency plans, reduced operating modes)
- Regulatory framework (Stronger cyber laws)
The question is no longer whether the next big cloud outage is coming, but when. And whether you are one of the winners or losers depends on how well prepared you are.
TL:DR
In this sense: Stay vigilant, stay resilient, and don't forget to check your local backups from time to time. You never know when the next ‘unintentional configuration change’ will come around the corner.