all posts

status

Why most status pages lie — the evidence

A pillar piece tracing 2025's biggest cloud outages and the gap between what the status page said and what was actually happening. With timestamps, dashboards, and the case for auto-driven status.

By Akshay Sarode· January 14, 2026· 14 min readstatus-pageuptimesyntheticpostmortem

The 150-word answer

In November 2025, someone posted a question to Hacker News that got 600+ upvotes in a few hours: "Why are most status pages delayed?" The thread is the most concise summary of the problem you can read. The answers are unanimous: because status pages are PR pages, not engineering signals.

In 2025, this stopped being a theoretical problem. Cloudflare went down on 18 November and their own status page went down with them. AWS us-east-1 went down on 20 October and the Service Health Dashboard took ~80 minutes to acknowledge customer impact. Slack on 26 February. GCP on 12 June. OpenAI on 26 November. Azure on 6 September. Each of these had a measurable gap between observed customer impact and status-page acknowledgement.

This piece is the evidence file. If you build or buy status pages, read this once. Then fix yours.

Why this matters

The reason a status page exists is to be the canonical answer to "is the thing broken?" so that:

  • Customer support doesn't drown in tickets that all say the same thing
  • Customers can route around your outage (failover, defer launches, etc.)
  • Your SLA math has a public, auditable surface
  • Your incident response team has one fewer thing to update during the worst hour of the year

When the status page lags reality, every one of those breaks. Tickets pile up. Customers retry, amplifying load on your already-dead infrastructure. SLA disputes happen in private. Your incident commander gets pinged by execs asking "why is the page still green?"

OneUptime put it bluntly: "Every minute unreported is a minute that doesn't count against your SLA." That's the perverse incentive. The product team wants the page to lag. The SRE team wants the page to lead. Without architectural enforcement, the product team always wins.

The 2025 outage chronology

Below is a chronological list of major 2025 outages where the status page failed alongside the outage, summarized from public postmortems and tracked at IncidentHub.

Slack — 26 February 2025

A backend infrastructure issue caused widespread connection failures across Slack workspaces globally. Users reported being unable to send messages, load channels, or join huddles for roughly 90 minutes. Slack's status page initially reported "investigating elevated error rates" — minimizing language for what was, in practice, a hard outage for many customers.

The gap between mass user reports (Downdetector lit up at 10:42 UTC) and Slack's first status acknowledgement (~10:54 UTC) was twelve minutes. That's actually one of the better numbers on this list.

GCP — 12 June 2025

A Google Cloud control-plane issue affected IAM and billing across multiple services. Cloud Run deployments failed, BigQuery jobs hung, and Vertex AI inference returned 500s. The Google Cloud Service Health page took roughly 35 minutes to update, and when it did, the affected-services list was incomplete for several more hours.

Azure — 6 September 2025

A network configuration change caused widespread availability issues for Azure customers. Microsoft Azure Status had inconsistent reporting between the customer-facing page and the internal Azure Service Health that customers see in their tenant — leading to widespread confusion about scope.

AWS us-east-1 — 20 October 2025

The big one of 2025. A DNS-related cascading failure in us-east-1 affected dozens of AWS services and an enormous number of downstream customers (Snapchat, Reddit, Robinhood, Zoom, Disney+ all reported issues). Multiple postmortems exist:

The most damning detail: AWS's own Service Health Dashboard took ~80 minutes between the start of measurable customer impact and the first dashboard acknowledgement. That's 80 minutes during which thousands of engineering teams worldwide were debugging "is it us or is it AWS?" with no canonical answer. The dashboard had a dependency on us-east-1, the region that was down. The company that wrote the book on availability shipped a status page that had a single-region SPOF.

Cloudflare — 18 November 2025

Cloudflare's edge had a global incident lasting roughly 2.5 hours. From the official postmortem:

"Cloudflare's status page went down. The status page is hosted completely off Cloudflare's infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence…"

Read that carefully. Cloudflare thought their status page was independent. They were wrong. A coincident failure in the third-party hosting they used for the status page meant that during the Cloudflare outage, the page that was supposed to tell customers Cloudflare was down was also down. The architectural intent was correct; the verification wasn't.

This is the case study every status-page architect should memorize. Intent isn't enough. Test the failover.

OpenAI — 26 November 2025

ChatGPT and the OpenAI API went hard down for roughly 90 minutes. OpenAI's status page lagged by about 25 minutes before the first incident was posted, and the severity language ("investigating") undersold the impact for another 40 minutes after that.

For a product where developers are billing-per-token and re-running failed requests, those minutes are real money.

The pattern

Across six major 2025 outages, the median delay between measurable customer impact and first status-page acknowledgement was roughly 25 minutes. The longest was AWS at 80 minutes. The shortest was Slack at 12 minutes.

In every case, the underlying technical fact — the API was returning 5xx, the region was unreachable, the auth service was timing out — was visible to external synthetic-monitoring tools (Downdetector, Pingdom, Datadog, ThousandEyes, our own tools) immediately, in some cases within a single 60-second polling window.

The data was there. The status page didn't reflect it.

Why does this happen?

Three reasons, in order of how often we see them:

1. The status page is a manual product

Most status pages — Atlassian Statuspage, Hyperping in default mode, in-house solutions — are CMS products. Someone has to log in and write a post. During an outage, that someone is also fighting the fire, fielding exec pings, and writing the postmortem timeline. Updating the public page is priority twelve.

2. The status page has a dependency on the thing being reported on

AWS us-east-1's dashboard ran on us-east-1. Cloudflare's status page coincidentally shared a vendor dependency with Cloudflare's own infrastructure. This is the architectural failure mode: status pages should have zero overlap with the production stack they report on. In practice, they often share DNS resolvers, CDN providers, certificate authorities, or BGP transit at the upstream — and any of those can take both down at once.

3. There's an incentive to delay

The PR team and the SRE team have opposing incentives during an incident. PR wants minimal language until the engineering picture is clear. SRE wants the public state to match observed reality so that customer support and partner teams can stop being asked "is it down?" Without architectural enforcement (auto-drive that the PR team can't easily disable), PR wins because PR is the one who logs in.

OneUptime nailed the incentive: "Every minute unreported is a minute that doesn't count against your SLA." In a contractually significant SLA agreement, the posted incident duration is what counts toward credits. So the longer your status page can plausibly say "we're investigating," the lower your SLA payouts. Architecturally, this is broken.

What a non-lying status page looks like

Three architectural commitments:

1. Auto-driven by default. Component state is bound to monitored signals (synthetic checks, SLO burn rate, error budget). When the signal crosses threshold, the component flips state. No human in the loop for the binary green/yellow/red transition. PR can add narrative on top, but cannot move the binary backwards without an explicit, time-limited override.

2. Hosted off the primary stack. Different cloud, different DNS resolver, different CDN, different certificate authority where possible. Tested monthly with a deliberate failover drill. Cloudflare's November 2025 incident is the case study for what happens when you say "off our infrastructure" but don't verify.

3. Static fallback. A periodic snapshot of current state, served from object storage on a third stack. So even if both the primary infra and the status-page renderer are down, the page can serve last-known-correct data — frozen in time but truthful.

We document the Sutrace implementation of all three in the honest status page use-case.

The procurement question

If you're evaluating a status page tool today, ask the vendor:

  1. Is your status page hosted on infrastructure that has zero dependencies on the production stack? Get a list of vendors. Check for overlap.
  2. What's the median time from monitor-detected failure to public-page state change, for components in auto-drive mode? A vendor that can't answer this in seconds doesn't have auto-drive.
  3. How often do you test the status-page-while-primary-down failover? "We don't" is a real answer; it tells you everything.
  4. What's the SLA on the status page itself? And: is that SLA monitored on the status page? (Trick question. The answer should be "yes, externally.")

If the vendor's answer is "we use Atlassian Statuspage internally," ask them what they did between 2 February 2026 and 23 February 2026 when Atlassian's System Metrics feature was broken for 21 days.

What we built

Sutrace's status page is auto-driven by default, hosted off our primary infrastructure (Cloudflare Pages + Bunny.net failover, separate DNS resolver), and falls back to a static snapshot served from object storage when the renderer can't reach Firestore. We test the failover monthly. The architecture is documented at /use-cases/honest-status-page.

We also wrote a separate analysis of the 2025 chronology where vendors' own pages went down, and a piece on the Atlassian 21-day outage and its lessons.

What you should do this week

If you have a status page today, do three things:

  1. List its dependencies. DNS, CDN, certificate authority, hosting, email sender. Check for overlap with your production stack.
  2. Measure the lag. Pick your last incident. Open the timestamp of first customer report and the timestamp of first status-page update. Subtract. If it's > 10 minutes, you have a status page that lies. Most do.
  3. Test the failover. Pretend your primary infra is down. Can the status page still serve? If you don't know, you don't have one.

If you're starting from scratch, evaluate Sutrace, Better Stack, Hyperping, and OneUptime. All four are honest about the architecture. All four can be configured for auto-drive. Pick the one whose pricing and feature surface fits.

The status page is a real engineering surface, not a marketing surface. Treat it like one.


Try Sutrace free at sutrace.io. Or read the deeper architecture piece at /use-cases/honest-status-page.