all posts

alert fatigue

Alert fatigue is cognitive fragmentation, and it's the top-3 SRE concern in 2026

70% of SREs report alert fatigue as a top-3 concern. The cause isn't volume — it's the cognitive fragmentation of repeated low-grade interruptions. A walk through Google's urgent / actionable / imminent rule and what tuned-by-default alerting looks like.

By Akshay Sarode· March 17, 2026· 14 min readalertingon-callsreobservability

Alert fatigue is cognitive fragmentation, and it's the top-3 SRE concern in 2026

TL;DR. The 2024 Catchpoint SRE Report — re-cited in OneUptime's March 2026 piece on alert fatigue + AI on-call — finds 70% of SRE respondents naming alert fatigue as a top-3 concern. The pattern in the last decade of incident research is clear: alert fatigue is not a volume problem. It is cognitive fragmentation — the slow erosion of the on-call engineer's attention by repeated low-grade interruptions that don't carry useful information. PagerDuty's canonical research puts the average on-caller at 50 alerts per week with 2–5% being actionable. Google's SRE book gives us the rule that fixes most of it: every alert must be urgent, actionable, and imminent. This post walks through why alert fatigue happens, why most observability vendors ship "alert on everything" defaults, and what tuned-by-default alerting actually looks like — with a concrete checklist.

If you're on-call this week and your phone has buzzed for things that didn't need a human, this is the post.

The numbers, with sources

Three datapoints to anchor the conversation:

1. The Catchpoint 2024 SRE Report. 70% of respondents place alert fatigue in their top-3 concerns. Re-cited and analysed in OneUptime's March 2026 piece.

2. PagerDuty's canonical research. The PagerDuty alert-fatigue learn page summarises a multi-year body of work. The headline numbers most teams know: average on-call engineer receives ~50 alerts per week, with only 2–5% being actionable. The remaining 95–98% are noise — informational alerts, flapping checks, alerts that resolved before the engineer could even respond.

3. The Atlassian incident-management research. Atlassian's alert-fatigue page names the same dynamic — alert volume up, signal-to-noise down, on-call burnout up.

Independent sources, same shape. The industry has a quantified problem.

Why alert fatigue is not a volume problem

The intuitive explanation is "too many alerts." That's wrong, or at least incomplete. If volume were the only variable, you could halve volume and halve fatigue. Teams that have tried this report it doesn't work — they cut the alert count by 50% and the on-call rotation reports no improvement.

The better framing comes from Hamza on dev.to, 10 March 2026:

the same number of incidents can feel very different depending on who receives them and when they occur.

That sentence reframes the entire problem. The cost of an alert is not the time spent reading it. It is the context-switch cost — the mental work of stopping what you were doing, evaluating, deciding (act / dismiss / wait), and resuming. That cost compounds non-linearly with frequency. Ten alerts in an hour is not 10x the cost of one alert; it's closer to 30x, because each interruption reduces your ability to recover from the next one.

Incident Copilot's framing of cognitive fragmentation that accumulates around repeated low-grade interruptions (cited across the on-call research community) is the more precise term. Fragmentation is the right word. You're not losing focus once. You're losing it in pieces, and the pieces don't reassemble.

The Google SRE rule — urgent, actionable, imminent

Google's SRE book defines the canonical test for "should this be an alert?" Three conditions, all required:

  1. Urgent. A human must respond now, not in an hour.
  2. Actionable. There is something the human can do. If the action is "wait until it self-resolves," it shouldn't have paged.
  3. Imminent. The condition affects users now, or will affect them imminently. Not "might affect them in three days if nothing changes."

If any of the three is false, it's not an alert. It might be a ticket, or a dashboard signal, or an email digest. It's not a page.

This rule is restated across the on-call research literature — Rootly's piece on managing alert fatigue, FireHydrant's alert fatigue dilemma post, and the DevOps.com on-call rotation best practices guide — but the violation is everywhere. Most production alerting setups today fail at least one of the three conditions for the majority of alerts they fire.

Why most vendors ship the wrong defaults

Here's the structural problem. Most observability vendors ship default alert rules that fire on anomalies, not on user impact. CPU > 80%. Disk > 85%. Memory pressure detected. Error rate up 2σ.

Each of these is a perfectly fine signal. None of them is reliably urgent + actionable + imminent. CPU at 80% might mean the workload is healthy and saturating. Disk at 85% might mean a logrotate is about to run. Error rate up 2σ might mean a single client is retrying.

Why do vendors ship these as defaults? Because the alternative — ship no defaults and require the customer to write their own — leaves the customer with a quiet observability product on day one, which feels broken. So vendors ship defaults that fire frequently enough to "feel like they're working." The cost of that feels-correct posture is paid by the on-call engineer six months later, when the team has 50+ rules nobody owns and 95% noise.

We unpack the five defaults that should be on out of the box (and the dozens that shouldn't) in tuned-by-default — the alerting defaults most vendors skip.

The research that doesn't get cited enough

Three sources beyond the canon are worth reading:

Rootly's "managing alert fatigue — what I wish I knew when starting as an SRE." Practitioner-written, full of "I thought X, then learned Y" moments. The most useful framing in here: an alert that fires more than once and is dismissed both times must be auto-resolved, not re-fired.

FireHydrant's "the alert fatigue dilemma." Argues that the on-call industry has been measuring the wrong thing — MTTR — and should be measuring the cost-per-alert including cognitive load. The argument generalises beyond their product.

Hamza's "on-call burnout — what incident data doesn't show." The dev.to piece quoted above. The key insight: the same number of incidents feels very different depending on context. A 3am page that turned out to be nothing has a different psychological footprint from the same page during a focus block in the afternoon.

Runframe's State of Incident Management 2025/2026. Industry survey. Confirms the alert-fatigue numbers and adds breakdowns by team size and on-call rotation length.

Tuned-by-default — the principles

Here's the principles version of what alerting should look like out of the box. We'll cover the concrete defaults in the tuned-by-default companion post.

1. Alert on user impact, not on internal signals.

The right alert is "checkout success rate dropped below 99% for 5 minutes." The wrong alert is "CPU on app-server-3 is high."

Internal signals belong on dashboards. They tell you why during incident triage. They don't justify a page on their own.

2. Use multi-window multi-burn-rate for SLOs.

Google's SRE workbook describes this pattern. Alert when the error budget is being burned at a rate that, if continued, would exhaust the budget in a short window. This catches both fast incidents (high burn rate, short window) and slow degradations (lower burn rate, longer window) without firing on every minor blip.

3. Auto-resolve.

Every alert must have a resolved condition. If the metric returns to healthy, the alert closes itself. The on-call engineer doesn't ack it manually. Anything else is noise tax.

4. Group related alerts.

If "checkout error rate" and "payment-service error rate" both fire within 60 seconds, they are the same incident, paged once. Most paging systems support grouping rules; most teams haven't configured them.

5. Severity tiers — with real differences.

P1 pages on-call. P2 emails the team. P3 creates a ticket. The mistake is making P2 also page on-call "just in case." That's where 70% of alert volume comes from.

6. Time-of-day awareness.

A non-customer-facing maintenance task that fails at 3am can wait until 9am. The system should know that. PagerDuty's research makes this point explicitly.

7. Alert ownership is non-optional.

Every alert must have a service owner. An alert with no owner is dropped from the rotation by default after 30 days. Ownership-decay is the silent killer of healthy alerting.

What this looks like in Sutrace

Sutrace ships with the seven principles above as defaults, not as opt-in. Specifically:

  • All packaged alert rules are user-impact rules. CPU/memory/disk thresholds are dashboard signals, not paging rules.
  • SLO alerts are multi-window multi-burn-rate by default, with windows configured per service and reasonable seed values.
  • Auto-resolve is on by default for every rule.
  • Grouping is on by default with a 60-second window and a service-correlation rule.
  • Severity tiers are real — P1 pages, P2 sends Slack, P3 creates a ticket. Changing the tier of an alert changes the routing without further config.
  • Time-of-day is configurable per route, with sensible regional defaults.
  • Every alert requires an owner before it can be enabled.

We documented the architectural side in the after-hours interruption load post and the OTel backend that this rides on in the OTel backend use-case page.

What to do this week

Three concrete actions that don't require a vendor change:

1. Print last week's alerts. Literally — open a CSV. For each alert, mark whether it was urgent, actionable, and imminent. The percentage that fail any of the three is your noise rate. A healthy team is under 10%; most teams are at 50–80%.

2. Find the rules that fired more than 20 times last month and were ignored. These are pure noise. Delete them. Yes, delete them. If the absence of the rule causes a real incident later, you'll add it back smarter.

3. Add an owner to every rule. A literal name in a literal field. Rules without owners get deleted in 30 days. The exercise of finding owners surfaces orphaned services.

The tuned-by-default companion post has the rule-by-rule version of this audit.

A note on AI-driven triage

The OneUptime piece discusses AI as a triage layer in front of human on-call. It's a real direction and we think there's substance there. But — and this is the substance of the disagreement — AI triage on top of bad defaults is the same as bad defaults with extra latency. The right order is: fix the defaults, then add AI as the triage layer that pattern-matches across historically-related alerts. AI doesn't substitute for the urgent / actionable / imminent test; it operates after that test has been passed.

Closing

Alert fatigue is the most studied unsolved problem in SRE. The research converges on a clean answer — Google's three-condition rule plus seven concrete principles — and most observability tooling ships defaults that violate it.

The on-call engineer's attention is the scarcest resource in any production system. Cognitive fragmentation, accumulating one low-grade interruption at a time, costs more than any single dropped page. The fix is fewer, better alerts — and tooling that defaults to "fewer, better" rather than "alert on everything."

If you're picking observability tooling and on-call quality matters, the tuned-by-default companion post and the after-hours interruption load post are the next two reads. The Sutrace pricing page covers what's bundled. We'd rather you stay on your current tool with better defaults than switch to us with the same defaults you have today.

Quiet on-call rotations are the goal. Everything else is downstream.