all posts

status

When the status page failed too — Cloudflare, AWS, Azure 2025

A timeline analysis of the 2025 outages where the vendor's own status page went down alongside the production stack. With the relevant Cloudflare admission quote about coincidental dependencies.

By Akshay Sarode· December 8, 2025· 11 min readstatus-pagepostmortemuptime

The 150-word answer

In 2025, the three biggest infrastructure providers — AWS, Cloudflare, and Microsoft Azure — each had a major outage where the status page meant to inform customers about the outage was also affected by the outage.

For AWS on 20 October, the Service Health Dashboard had a us-east-1 dependency. For Cloudflare on 18 November, a coincident third-party hosting failure took the status page down at the same time. For Azure across multiple events, the customer-facing status page diverged from the in-tenant Service Health view.

The pattern is consistent: status pages are designed with the intent to be independent, but rarely with verified independence. The lesson is architectural — DNS, CDN, certificate authority, BGP transit, and hosting must be disjoint from the production stack, and the disjointness must be tested with regular drills.

Why "off our infrastructure" isn't enough

Three of the most sophisticated infrastructure teams in the world told themselves in 2025 that their status pages were independent. All three were wrong, in different ways. That's the headline.

This piece is the chronology — for each of the three, what was claimed about status-page independence, what actually happened, and what the architectural lesson is.

Cloudflare — 18 November 2025

The shortest, clearest case. From the official Cloudflare postmortem:

"Cloudflare's status page went down. The status page is hosted completely off Cloudflare's infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, the status page going down at the same time as the network was experiencing issues led some to speculate that Cloudflare was under a large-scale attack."

Read the sentence carefully. Cloudflare states two facts:

  1. The status page is hosted off Cloudflare infrastructure.
  2. The status page has no dependencies on Cloudflare.

Then it states a third fact: it went down anyway, at the same time, coincidentally.

This is the architectural lesson. "No dependencies on Cloudflare" is necessary but not sufficient. The status page provider — whoever Cloudflare uses — had its own incident at the same time. The intent of independence was fulfilled. The outcome of independence wasn't.

What does sufficient look like? Multi-vendor failover at the status-page layer. Two providers. If one is down, the other serves. Plus a static snapshot in a third location. Cloudflare's postmortem doesn't go into how they'll fix this, but the answer must be more than "find a more reliable single vendor." Single-vendor reliability is, mathematically, the wrong abstraction.

AWS us-east-1 — 20 October 2025

The longest, most embarrassing case. AWS's Service Health Dashboard (SHD) had a known historical dependency on us-east-1, the region that went down. Multiple outlets covered this:

The timeline, reconstructed from public sources:

Time (UTC)Event
~07:00DNS resolution issues begin in us-east-1
~07:11Downdetector spike for AWS-dependent services (Snapchat, Reddit, Robinhood)
~07:15Mass customer reports across HN, Twitter, support channels
~08:31AWS Service Health Dashboard posts first acknowledgement
~08:31Estimated ~80 minutes from impact start to dashboard ack

The 80-minute number is approximate (different observers measured slightly different start times), but it's directionally correct and it's the number that matters. For 80 minutes, the canonical "is AWS down?" surface said no, when the rest of the internet was clearly saying yes.

Why? Because the SHD itself was running on the affected region. The dashboard couldn't update because the infrastructure that the dashboard ran on was the infrastructure being reported on. AWS has been working to remove this dependency for years. As of October 2025, the work wasn't done.

This is the most basic architectural failure imaginable, in the most architecturally sophisticated company on earth, on the highest-profile region they operate. If it happens to AWS, it happens to anyone.

Azure — 6 September 2025

The most confusing case. Azure has two status surfaces:

  1. The public Azure Status page, visible at status.azure.com
  2. The in-tenant Service Health view, visible only when logged into the Azure Portal

These two surfaces are supposed to agree. In the 6 September incident, they diverged. The public page reported one set of affected services and regions; the in-tenant view reported a different (broader) set. Customers saw their own tenant's services flagged red while the public page still said most things were green.

The mechanism: the public page is a curated subset, designed for "broad customer impact" thresholds. The in-tenant view is per-tenant and shows you anything affecting your subscription. When an incident affects "many tenants but a small percentage of total tenants," the public page can plausibly stay green while many individual customers are red.

This is a different failure mode than AWS or Cloudflare. The infrastructure for Azure Status itself worked fine. The editorial threshold for what gets posted publicly created the gap. It's the OneUptime quote: "every minute unreported is a minute that doesn't count against your SLA." If the threshold for posting is "broad impact," and "broad" is defined liberally, the page is green during real customer incidents. The architecture is fine; the policy lies.

The pattern, in three lines

  • AWS: status page had infrastructure dependency on the region being reported on
  • Cloudflare: status page had a coincident third-party dependency, despite being "off our infra"
  • Azure: status page had editorial thresholds that lagged real customer impact

Different failures. Same outcome: customers couldn't trust the canonical answer for "is it down?"

What "verified independent" looks like

Three commitments, in order:

1. Disjoint stack

Every layer of the status page (DNS, CDN, certificate, hosting, state store, email) must be on a different vendor than the production stack. Not "different account at the same vendor" — different vendor. AWS using AWS for SHD is the canonical anti-pattern.

2. Multi-provider failover at the status-page layer itself

Even within the disjoint stack, single-vendor failures happen. Cloudflare's case is the proof. The status page should have at least two independent serving paths, with automated failover. Plus a static snapshot in a third location.

3. Tested failover drills

Monthly. On a public schedule. Disconnect the primary status-page renderer, confirm the fallback serves correct data, restore. Log the drill on the public page itself for transparency.

Sutrace does all three. We documented the architecture at /use-cases/honest-status-page and the broader argument at why most status pages lie.

The procurement test

If you're evaluating a status-page tool, ask:

"What are the upstream vendors for your status-page DNS, CDN, certificate authority, and hosting? And how do those compare to the upstream vendors for your monitoring/uptime product?"

A vendor that answers this in 30 seconds with concrete vendor names — and where the two lists are disjoint — is one to take seriously.

A vendor that can't answer, or whose two lists overlap on more than one vendor, has an unverified claim of independence. They're Cloudflare in November 2025 waiting for a coincident failure.

What the 2025 chronology should change

Three things:

  1. Postmortem culture should treat status-page failure as a P0 finding. Not a footnote. The status page failing during the outage is a separate, equally important, architectural finding.
  2. SLA contracts should specify status-page availability separately. Contractually, customers should be able to claim credits for status-page downtime during a primary outage, because it materially worsens the impact.
  3. Vendor selection should weight status-page independence verification. Not the marketing claim. The verified, drilled, recently-tested architecture.

We published the chronology because every team running a public status page should read these three postmortems, then look hard at their own setup. Most setups will not pass.

External sources cited


Try Sutrace free at sutrace.io. Status page included at every tier, hosted off our primary infrastructure, with monthly failover drills.