Reflections on the AWS & Azure Outages

• 5 min read

Updated Nov 2, 2025 10:15 AM UTC

AWS and Azure, two of the largest cloud providers having a market share of ~30% and ~23% respectively (as of 2025), experienced large scale outages recently. These outages affected multiple interdependent services. In this post, I will explore what are the implications of these outages, what can be done about them, and how fragile the web really is.

Table of Contents

AWS Outage - October 19, 2025

The AWS Outage on October 19, 2025 began with Amazon's Dynamo DB service — a serverless, managed, NoSQL database service — in the us-east-1 AWS region. As per the official incident report, Dynamo DB's DNS management system published an "empty DNS record" for one of the endpoints (dynamodb.us-east-1.amazonaws.com), restricting any external/internal service to connect to it. In order to understand, you have to first understand what DNS is.

DNS stands for "Domain Name Server" and is like a phone book that maps human readable addresses to IP addresses (machine addresses). For example, openai.com maps to 172.64.154.211, an IP address of the server/machine the website is hosted on.

Pro Tip

Run the following command to see it yourself:

ping openai.com

# output
PING openai.com (172.64.154.211): 56 data bytes
64 bytes from 172.64.154.211: icmp_seq=0 ttl=57 time=14.347 ms
64 bytes from 172.64.154.211: icmp_seq=1 ttl=57 time=15.783 ms
64 bytes from 172.64.154.211: icmp_seq=2 ttl=57 time=13.898 ms
64 bytes from 172.64.154.211: icmp_seq=3 ttl=57 time=14.241 ms
64 bytes from 172.64.154.211: icmp_seq=4 ttl=57 time=13.575 ms
64 bytes from 172.64.154.211: icmp_seq=5 ttl=57 time=14.347 ms
64 bytes from 172.64.154.211: icmp_seq=6 ttl=57 time=15.144 ms
^C
--- openai.com ping statistics ---
7 packets transmitted, 7 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 13.575/14.476/15.783/0.696 ms

Press CTRL+C to exit

So, if there's no address listed for a particular endpoint, other machines can't get to it because they don't know where it lives. Technically, that endpoint in now invisible. A similar thing happened with Meta (Facebook) in the past (October 4, 2021) where they had issues with the BGP protocol, in which Facebook stopped announcing their addresses (locations) which resulted in routers not able to find Facebook's servers.

Coming back to the AWS outage, the Dynamo DB DNS resolution issue was just the beginning. It started a domino effect where after the DNS issue was fixed, the EC2 instances which were trying to connect to Dynamo DB via DWFM (Droplet Workflow Manager — an AWS Droplet is a physical server on which EC2 instances run) the whole down time, experienced "congestive collapse", meaning DWFM was repeatedly trying to connect (the number of broken leases increased — a lease is a logical assignment that maps a specific portion of physical server resources (CPU, RAM, etc.) to an EC2 instance for a certain period.), but when Dynamo DB was up, DWFM struggled to keep up with resolving the number of broken leases.

In other words, the system entered a state where it kept on doing its job, but the inputs timed out before they were getting addressed.

Clients continue to send work, and the system continues to complete that work. Throughput is great. None of the work is useful, though, because clients aren’t waiting for the results, so goodput is zero. The system is mostly stable in this state, and without an external kick, could continue going along that way indefinitely. Up, but down. Working, but broken. - Marc Brooker

The EC2 instances then impacted the Network Load Balancers (NLB).

RECOMMENDED

Want to get technical? Deep Dive into the AWS Outage with More Than DNS: The 14 hour AWS us-east-1 outage - by Jonathon Belotti

Azure Outage - October 29, 2025

In 2 weeks since the AWS outage, Azure, the second big cloud provider faced an outage on October 29, 2025, which Microsoft says was caused due to "inadvertent tenant configuration change". This change was made to the Azure Front Door service which is basically a CDN (Content Delivery Network). AFD (Azure Front Door) affected multiple services of Microsoft as it's a CDN, internal/external services depend on it for accessing stored blobs/files.

Microsoft has not yet published a full report of how it exactly happened, but from little do we know, there's an automated validation system which runs after such configuration change is made and it typically blocks such faulty configurations. But, in this case, it didn't. The automated validation system itself had issue with withholding the configuration change and that's how it by-passed the checkpoints.

Thoughts

These outages remind me how fragile the web really is. We think of these big tech corporations and wonder what's the worst thing that can happen to them? I mean, they have all of the resources, power, and money, so how can anything go wrong with them?

The fact is that no matter how big the tech organization is, they all depend on basic technologies which make the web, world wide web. No matter how much they claim they are reliable, the truth is, nothing is reliable enough. All of these complex systems are just abstractions built on top of really basic technologies, and when those basic technologies are affected, disruptions happen.

It all boils down to a mesh of interconnected systems which talk to each other, the only difference is, we don't see any of that. We are so glamorized by these complex software abstractions that we fail to see the very basic foundation they were built up on.

At the end, we are all sitting on top of fibre optics.

Zerodha, a brokerage firm, announced their FLOSS fund in 2024 to fund open source projects. Nithin Kamath, CEO of Zerodha, shared a post on why everyone should support and fund open source projects. He made a really good point that companies which earn billions of dollars relying on many small open source projects (without them their services would be rendered completely useless) are simply not willing to sponsor or fund them. Now, I know I am going a little off tangent with this, but the point is that if one of the underlying services fail, you start to see a domino effect where all of the above abstractions fail with it.

So, as much as maintaining the complex systems is important, it's equally important to maintain and fund the small, miniscule-looking technologies which are basically the backbone of software abstractions.

Reflections on the AWS & Azure Outages

AWS Outage - October 19, 2025

Azure Outage - October 29, 2025

Thoughts

Related posts ✍️