In the world of cloud infrastructure, "rare" is often another way of saying "eventually." This fall, we were reminded of that—three times.
Between October and November 2025, AWS, Azure, and Cloudflare each experienced major disruptions. These outages differed in cause and impact, but they shared something deeper: they revealed how fragile the global cloud control planes really are, and how tightly modern systems are coupled behind the scenes.
While the industry has grown more resilient, these incidents show that we still operate in a world where a single stale DNS plan, an incompatible metadata snapshot, or an oversized feature file can bring down services used by millions.
At Bluebricks, we study these events closely. In this blog post, we provide a short breakdown of each outage, its root cause, and how the incident unfolded.
Here's what happened.
Amazon DynamoDB Disruption — October 19–20, 2025
Late on October 19, AWS customers in us-east-1 began seeing elevated DynamoDB error rates and connection failures. At first, it looked like just another availability event. It wasn't.
AWS routes DynamoDB endpoints using a multi-component DNS management system made up of independent DNS Planners and redundant DNS Enactors. These components generate and push DNS configurations that keep endpoint records healthy.
A rare race condition occurred when:
The result: the endpoint lost all IP addresses, effectively removing DynamoDB from DNS.
AWS engineers restored DNS by 2:25 AM PDT, but the real story began after recovery.
DynamoDB underpins several AWS internal systems, including EC2's DropletWorkflow Manager (DWFM), which manages leasing between EC2 control planes and physical hosts.
Because DWFM couldn't read state from DynamoDB:
Azure Front Door Outage — October 29–30, 2025
On October 29, customers using Azure Front Door (AFD) and Azure CDN began experiencing DNS resolution issues, timeouts, and failures across critical Microsoft services — including the Azure portal and Microsoft 365.
A sequence of fully valid customer configuration changes was applied across two different control-plane build versions. Each version generated metadata in a slightly different format.
This inconsistency went unnoticed and triggered:
Engineers had to manually repair the LKG, then trigger a phased global redeployment starting at 17:40 UTC, with full recovery just after midnight on October 30.
Azure's architecture relies on strong control-plane gating and health validation. This outage exploited a blind spot: asynchronous failures that occur after synchronous approvals.
Cloudflare Outage — November 18, 2025
On November 18, the internet had a rough morning. Cloudflare customers worldwide saw 5xx errors, service interruptions, and intermittent failures across sites and APIs.
Despite early speculation, this was not an attack — it was a control-plane configuration issue.
Cloudflare engineers rolled out an improvement to permissions management for a ClickHouse cluster used by the Bot Management pipeline.
The unintended consequence:
The fix required stopping the faulty file generation and manually inserting a known-good configuration. Most impact was resolved by 14:30 UTC.
This outage highlights a recurring pattern in cloud edge networks:
The data plane was stable — the control plane fed it poison.
Across AWS, Azure, and Cloudflare, the outages of late 2025 exposed a single unifying truth: modern cloud infrastructure is far more interdependent than many teams realize.
In each case, a small or valid control-plane action cascaded into large-scale failure, whether through a DNS race condition, incompatible metadata poisoning a global snapshot, or a small query doubling a configuration file.
While each provider's systems failed for different reasons, the pattern was identical: fragile control planes feeding critical data planes.
As cloud architectures grow more complex, resilience depends not only on robust data planes, but on safer, more predictable, and more isolated control-plane design.