Three Cloud Outages, One Message: Our Infrastructure Is More Interdependent Than Ever

Pini Vaknin
By Pini VakninCo-founder and CTO ⋅ November 25, 2025

In the world of cloud infrastructure, "rare" is often another way of saying "eventually." This fall, we were reminded of that—three times.

Between October and November 2025, AWS, Azure, and Cloudflare each experienced major disruptions. These outages differed in cause and impact, but they shared something deeper: they revealed how fragile the global cloud control planes really are, and how tightly modern systems are coupled behind the scenes.

While the industry has grown more resilient, these incidents show that we still operate in a world where a single stale DNS plan, an incompatible metadata snapshot, or an oversized feature file can bring down services used by millions.

At Bluebricks, we study these events closely. In this blog post, we provide a short breakdown of each outage, its root cause, and how the incident unfolded.

Here’s what happened.

A DNS Race Condition Turns Into a Regional Control Plane Breakdown

Amazon DynamoDB Disruption — October 19–20, 2025

Late on October 19, AWS customers in us-east-1 began seeing elevated DynamoDB error rates and connection failures. At first, it looked like just another availability event. It wasn’t.

What Actually Happened

AWS routes DynamoDB endpoints using a multi-component DNS management system made up of independent DNS Planners and redundant DNS Enactors. These components generate and push DNS configurations that keep endpoint records healthy.

A rare race condition occurred when:

  • A delayed Enactor applied an older DNS plan to the regional endpoint.
  • A second Enactor deployed a new plan and triggered a cleanup process.
  • The cleanup logic mistakenly deleted what it believed was stale state — but it was actually the active plan.

The result: the endpoint lost all IP addresses, effectively removing DynamoDB from DNS.

AWS engineers restored DNS by 2:25 AM PDT, but the real story began after recovery.

The Cascading Failure

DynamoDB underpins several AWS internal systems, including EC2’s DropletWorkflow Manager (DWFM), which manages leasing between EC2 control planes and physical hosts.

Because DWFM couldn’t read state from DynamoDB:

  • It lost leases with underlying droplets.
  • When DynamoDB returned, DWFM tried to rehydrate a massive backlog and entered congestive collapse.
  • EC2 launches began failing with “insufficient capacity” and “request limit exceeded” errors.
  • The Network Manager struggled to propagate network configs.
  • NLBs saw alternating health-check failures, expanding the blast radius.

Incompatible Metadata Silently Corrupts the Global Last Known Good

Azure Front Door Outage — October 29–30, 2025

On October 29, customers using Azure Front Door (AFD) and Azure CDN began experiencing DNS resolution issues, timeouts, and failures across critical Microsoft services — including the Azure portal and Microsoft 365.

What Actually Happened

A sequence of fully valid customer configuration changes was applied across two different control-plane build versions. Each version generated metadata in a slightly different format.

This inconsistency went unnoticed and triggered:

  • A latent bug in the AFD data plane.
  • An asynchronous crash ~5 minutes after deployment.
  • A misleading period of healthy signals during that five-minute window.
  • Propagation of the corrupted config to the global edge network.
  • A polluted Last Known Good (LKG) snapshot, making rollback impossible.

Engineers had to manually repair the LKG, then trigger a phased global redeployment starting at 17:40 UTC, with full recovery just after midnight on October 30.

Why This Matters

Azure’s architecture relies on strong control-plane gating and health validation. This outage exploited a blind spot: asynchronous failures that occur after synchronous approvals.

A Small Database Change Brings Down a Global Edge Network

Cloudflare Outage — November 18, 2025

On November 18, the internet had a rough morning. Cloudflare customers worldwide saw 5xx errors, service interruptions, and intermittent failures across sites and APIs.

Despite early speculation, this was not an attack — it was a control-plane configuration issue.

What Actually Happened

Cloudflare engineers rolled out an improvement to permissions management for a ClickHouse cluster used by the Bot Management pipeline.

The unintended consequence:

  • A query began returning duplicate rows, doubling the size of the generated feature-configuration file.
  • When this file propagated to Cloudflare’s proxy systems (FL/FL2), it exceeded a hard-coded limit of 200 features.
  • The proxies crashed.
  • The query kept running every five minutes, continually regenerating and redeploying the bad config.
  • Systems oscillated between recovery and collapse.

The fix required stopping the faulty file generation and manually inserting a known-good configuration. Most impact was resolved by 14:30 UTC.

Why This Matters

This outage highlights a recurring pattern in cloud edge networks:

The data plane was stable — the control plane fed it poison.

Summary

Across AWS, Azure, and Cloudflare, the outages of late 2025 exposed a single unifying truth: modern cloud infrastructure is far more interdependent than many teams realize.

In each case, a small or valid control-plane action cascaded into large-scale failure, whether through a DNS race condition, incompatible metadata poisoning a global snapshot, or a small query doubling a configuration file.

While each provider’s systems failed for different reasons, the pattern was identical: fragile control planes feeding critical data planes.

As cloud architectures grow more complex, resilience depends not only on robust data planes, but on safer, more predictable, and more isolated control-plane design.

Ready to see Bluebricks in Action?

Reach out and we'll show you around

Book a Demo
Bluebricks logo
Agentic AI
About
Blog
Pricing
Docs
LoginSee Bluebricks Live
HomeBook a DemoCareers
Bluebricks Logo White
Privacy PolicyCookie PolicyTerms of UseSupportTrust CenterNewsletter
Available onAWS
Available onGoogle Cloud
  • Next October
  • AWS Partners
  • SOC2
  • Bluebricks Linkedin
  • Bluebricks Github
©2025 Bluebricks Ltd. All rights reserved.
HomeBook a DemoCareers
Bluebricks Logo White
  • Next October
  • AWS Partners
  • SOC2
Available onAWS
Available onGoogle Cloud
  • Bluebricks Linkedin
  • Bluebricks Github
©2025 Bluebricks Ltd. All rights reserved.
Privacy PolicyCookie PolicyTerms of UseSupportTrust CenterNewsletter
Bluebricks Logo White
HomeBook a DemoCareers
  • Next October
  • AWS Partners
  • SOC2
Available onAWS
Available onGoogle Cloud
  • Bluebricks Linkedin
  • Bluebricks Github
©2025 Bluebricks Ltd. All rights reserved.
Privacy PolicyCookie PolicyTerms of UseSupportTrust CenterNewsletter