all posts

alert fatigue

After-hours interruption load — the statistic PagerDuty doesn't publish

50 alerts a week. 2-5% actionable. 71% of SREs respond to dozens-or-hundreds of un-ticketed incidents per month. The off-hours dimension of alert fatigue, and why ship-faster culture compounds it.

By Akshay Sarode· April 8, 2026· 10 min readalertingon-callsrepagerduty

After-hours interruption load — the statistic PagerDuty doesn't publish

TL;DR. The canonical PagerDuty stat — 50 alerts/week with 2–5% actionable — has been the headline number for alert-fatigue research for a decade. The number it conceals is when those alerts arrive. Independent surveys (the 2026 Runframe State of Incident Management, the DevOps.com on-call rotation research) put the share of after-hours alerts at roughly half, and 71% of SREs report responding to dozens or hundreds of un-ticketed incidents per month — the alerts that never made it into the post-incident report because they self-resolved before anyone wrote them up. Sylvain Kalache's 15 March 2026 framing ties the dynamic to ship-faster culture: the same teams that are praised for shipping velocity carry the on-call burnout that velocity creates. This post is the off-hours dimension that the alert-fatigue pillar post didn't fully cover.

If you've been on-call this month and you're tired in a way you can't quite explain, this post is for you.

The number behind the number

PagerDuty's alert-fatigue learn page is the most-cited source in the literature. The two numbers from it that everyone quotes:

  • ~50 alerts per week per on-call engineer
  • 2–5% of those alerts are actionable

What PagerDuty's own page doesn't cleanly publish — and what the broader research community has been documenting independently — is the time-of-day distribution. The number that matters is the share of those 50 alerts that arrive during sleep hours.

The Atlassian on-call alert-fatigue page and the DevOps.com on-call rotation guide point at the same pattern: roughly half of alerts arrive outside business hours, with a non-trivial cluster in the 11pm–6am window. That makes intuitive sense — most production traffic is during business hours, but most automated jobs (backups, replications, ETLs) run overnight, and most network operations (DNS changes, certificate rotations, edge config pushes) happen in low-traffic windows.

So the median on-call engineer's week is something like: 25 alerts during business hours (annoying, breaks focus, impacts work), 25 alerts during off-hours (some during dinner, some during the kid's bedtime, some at 3am). The 3am ones disproportionately drive burnout.

The 71% statistic

Rootly's piece on managing alert fatigue and the FireHydrant alert-fatigue dilemma post both reference a version of this stat: roughly 71% of SREs respond to dozens or hundreds of un-ticketed incidents per month. That word — un-ticketed — is the key.

Un-ticketed means: the alert fired, the engineer woke up, looked at it, decided it was a false positive or a flap or a self-resolved blip, and went back to sleep without filing a post-incident report. Those incidents disappear from the formal record. They don't show up in MTTR dashboards. They don't show up in incident frequency reports. They show up nowhere except in the engineer's sleep deficit.

This is the metric the industry has been systematically under-counting. If your team's MTTR is 18 minutes and you have 4 incidents a month, your dashboards say things are healthy. They probably are healthy. But the on-call engineer also responded to 80 un-ticketed buzzes that month. The on-call cost of your team is 80, not 4.

The Sylvain Kalache framing

From the DevOps.com piece, 15 March 2026, we lift Sylvain Kalache's framing: the cultural valorisation of shipping velocity creates the on-call burnout it doesn't pay for. Teams that ship faster touch more services, deploy more often, and create more opportunities for alerts to fire. The CEO sees the velocity dashboard. The on-call engineer sees the buzz log. The two metrics are fundamentally connected and the org chart ensures only one of them is a board-level conversation.

Kalache's argument generalises beyond his immediate context. The structural dynamic is: ship velocity creates surface area, surface area creates alert volume, alert volume falls disproportionately on the engineers who are also the ones shipping. Burnout is the bill, and it's paid in talent attrition.

The 2026 Runframe State of Incident Management confirms this with hard numbers — teams that report "high ship velocity" also report higher on-call burnout, and the correlation is significant.

What "after-hours interruption load" actually costs

There's a body of sleep-research literature on the cost of partial sleep disruption. The short version: a single 3am page that took 10 minutes to triage costs the engineer roughly 60–90 minutes of effective sleep, because of the time required to fall back asleep plus the disruption of REM cycles. Two such pages in a week is the cognitive equivalent of pulling an all-nighter, distributed across the week.

This is the part that doesn't show up in any vendor dashboard. PagerDuty knows how many pages they delivered. They don't know what those pages cost in lost sleep, missed REM cycles, and the next day's reduced cognitive performance.

The Hamza dev.to piece cited in the pillar post is the most honest articulation of this:

the same number of incidents can feel very different depending on who receives them and when they occur.

A 50-alert week distributed evenly across business hours is annoying. A 50-alert week with 25 of them in sleep hours is the kind of week that drives the senior engineer to update their LinkedIn.

The five defaults that specifically address off-hours load

The tuned-by-default companion post covers the general defaults. Three of those defaults are specifically about off-hours:

1. Time-of-day awareness on the routing layer.

Non-customer-facing alerts that fire at 3am should wait until 9am unless they meet a specific impact threshold. A backup-job failure at 3am that doesn't affect production traffic can wait six hours. Most teams don't configure this because the routing tool doesn't make it easy. It should be the default.

2. Severity tiers with real channel differences.

P2 in the middle of the night should not page. It should go to a Slack channel where the morning shift sees it. The mistake is treating P2 as "P1 with slightly less urgency" — that's not what severity is for.

3. Auto-suppression on rate limits.

If a single rule has fired 5 times in the last hour and the on-call engineer has acknowledged but not resolved any of them, the system should auto-suppress further pages from that rule for the next hour. The engineer is awake; they know it's flapping; further buzzes do not produce additional value. This is one of the simplest defences and almost no vendor ships it on by default.

What we don't recommend, and what the research community is divided on, is "no pages outside business hours." That's a non-starter for any team that owns customer-facing services. The right answer is granular routing — page on customer-impact, suppress on internal-only.

What an honest on-call dashboard looks like

If you're a manager and you want to know how your team is actually doing, the dashboard you want is not MTTR. It's something like:

  • Pages per engineer per week — separated by hour-of-day buckets (business / evening / sleep).
  • Acknowledged-and-not-resolved count — flapping alerts the engineer knows about and is ignoring.
  • Median time-to-resolve, separated by hour-of-day — sleep-hour resolutions are slower because the engineer is impaired; that's the cost.
  • Un-ticketed alert count — alerts that fired and were dismissed without a post-incident report. The 71% number lives here.
  • Repeat-rule rate — the percentage of alerts that came from the top-10 noisiest rules. If this is over 50%, you have a tuning debt.

Most observability tools today don't compute these natively. They compute MTTR and incident counts because those are the numbers their customers initially asked for. The numbers above are what you'd ask for if you'd been on-call this year.

What this looks like in Sutrace

The on-call view in Sutrace shows the dashboard above. The metrics that drive it come from the same OTel pipeline as the rest of the observability data — alerts are first-class events, and the routing layer is instrumented end-to-end. We covered the OTel side in the OTel backend use-case page.

Specifically:

  • The on-call dashboard partitions pages by hour-of-day bucket, with a "sleep deficit" estimator for the team.
  • Time-of-day routing is on by default for severities below P1.
  • Auto-suppression on rate-limited rules is on by default with a 5-fires-in-1-hour threshold.
  • The "noisiest 10 rules" report is published weekly to a Slack channel of your choice.

We don't claim to have eliminated alert fatigue. We claim to have shipped the operating manual on by default. The rest is the team's own tuning work, which we'd rather you do less of.

What to do this week

Three actions, ordered by ease:

1. Print last month's pages. For each, mark the hour-of-day. The histogram alone is informative — most teams have never looked at it.

2. For pages between 11pm and 7am, ask: which were customer-facing impact? Of the rest, raise the question: do these need to page in real time, or can they wait for the morning shift?

3. For your three noisiest rules — the ones with the highest fire rate that produced no incident report — propose deletion or rate-limiting. Get team buy-in on a 30-day trial. The world will not end.

The tuned-by-default companion and the pillar post cover the architectural side. This post covered the time-of-day side specifically because it's the variable most teams under-measure.

Closing

The 50-alerts-a-week stat is famous because it's evocative. It conceals the variable that actually drives burnout — the hour-of-day distribution. Half of those alerts are off-hours; a meaningful fraction are sleep-hours; and the cost of a sleep-hour alert is non-linearly higher than a business-hour alert.

The fix is granular routing, severity-tier-aware delivery, and auto-suppression on flapping rules — three defaults that don't ship in most observability tools and should. The pricing page covers what's bundled in our plan; the Datadog comparison and the Better Stack comparison cover the alternatives honestly.

If you remember one thing: when you next look at your team's on-call dashboard, partition the pages by hour-of-day. The chart will tell you something the MTTR number can't.