alert fatigue
Tuned by default — the five alerting defaults most observability vendors skip
Why "alert on everything" is the structural cause of alert fatigue, and the five defaults that should ship in the box. With concrete config you can copy.
Tuned by default — the five alerting defaults most observability vendors skip
TL;DR. Most observability vendors ship "alert on everything" defaults — CPU thresholds, disk thresholds, anomaly detection on every metric, every check ringing PagerDuty at P1. The result is the alert-fatigue cycle the PagerDuty research and the Catchpoint SRE report keep documenting — 50 alerts a week, 2-5% actionable, 70% of SREs naming it as a top-3 concern. This is a structural problem, not a customer problem. Vendors ship it because the alternative — a quiet product on day one — feels broken to the buyer. This post lists the five defaults that should ship in the box, with the YAML/JSON shape, the rationale per default, and the line your CFO will appreciate. Companion to the alert-fatigue pillar post which covers the cognitive-fragmentation framing.
If you've ever silenced a vendor's default rule and felt slightly guilty, this post is your justification.
The structural problem in one paragraph
A new buyer evaluating an observability vendor opens a free trial. On day one, the dashboards must show data and the alerts must show something firing — otherwise it feels broken. So the vendor ships a starter pack of rules: CPU > 80%, disk > 85%, latency > 2σ above baseline. The buyer sees alerts, feels the product is working, signs the contract. Six months later the team has 50+ rules, most of them descended from the starter pack, and 95% noise. The trial-time decision is paying alert-fatigue tax for the next three years.
The fix isn't "ship no defaults." It's "ship five defaults that survive a year of production." Below is the list.
Default 1 — SLO burn-rate alerts on user-facing services, multi-window multi-burn-rate
This is the canonical Google SRE workbook pattern. The shape:
# Burn-rate alert template — copy and customise per service
- alert: HighErrorBudgetBurn-FastBurn
expr: |
(
sum(rate(http_requests_total{job="checkout",code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="checkout"}[1h]))
) > (14.4 * 0.001)
AND
(
sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: page
service: checkout
slo: 99.9
annotations:
summary: "Checkout: error budget burn rate >14.4x over 1h and 5m"
runbook: "https://internal.example.com/runbooks/checkout-error-budget"
Two conditions, both must be true. The first window catches fast burns (a real incident). The second window suppresses the alert when it's a 5-minute blip that already self-resolved. The math: at a 14.4x burn rate, a 99.9% SLO error budget is exhausted in 2 hours, which is squarely in "page someone now" territory.
Why this is the right default: it's user-impact rather than internal. It uses two windows so it doesn't fire on noise. It has a runbook field. It has an owner via the service label. Every box in the Google SRE checklist is ticked.
What most vendors ship instead: a threshold on http_5xx_error_rate > 1% with no second window. Fires on 30-second blips. Wakes someone up. Self-resolves before they ack.
Default 2 — Saturation alerts that watch the rate of approach, not the absolute value
The "disk > 85%" alert is the canonical bad alert. Disk at 85% is an information signal, not an action signal. The action signal is "disk is filling at a rate that will exhaust capacity in < N hours."
Better default:
- alert: DiskFillRateUrgent
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
AND
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.20
for: 15m
labels:
severity: page
annotations:
summary: "{{ $labels.instance }} root volume will fill within 4h at current rate"
runbook: "https://internal.example.com/runbooks/disk-fill"
Reads as: "if linear regression of available bytes over the last 6 hours predicts we'll be at zero in 4 hours, and we're already below 20% free, page." The combined condition rules out short-lived growth bursts (logrotate, large DB compaction) that don't represent a real fill trend.
Compare to a vendor default of "disk > 85% page": the latter fires on every healthy machine that runs steady-state at 85% utilisation, multiple times per week. The first one fires only when something is actually wrong. The signal-to-noise difference is dramatic.
This is the principle: alert on the trajectory, not the absolute. CPU at 80% sustained with predicted-saturated-in-N-minutes is alertable. CPU at 80% for 60 seconds is not.
Default 3 — Auto-resolve, on every rule, no exceptions
Every alert must have a resolved condition. If the metric returns to healthy, the alert closes itself. The on-call engineer doesn't ack it manually unless they want to add context.
In Prometheus this is automatic — when the condition is no longer true, the alert resolves. In Datadog and many commercial tools, "resolve when condition is false" is opt-in per rule. That's the configuration error responsible for half the world's stale alerts.
Concretely, in Datadog:
alert_settings:
notify_no_data: false # DO NOT page when data goes silent
no_data_timeframe: 30 # only consider missing data after 30 min
notify_audit: false
on_missing_data: resolve # NOT show_no_data — this is the bug
evaluation_delay: 60
renotify_interval: 0 # don't spam after first page
evaluation_window: 5m
The line that matters: on_missing_data: resolve. The vendor default in many shops is show_no_data which keeps the alert open and forces a manual ack. That's a noise tax on every monitoring agent restart, every brief network blip, every reboot.
The principle: the on-call engineer's time is too valuable to spend acknowledging alerts that have already self-resolved. If your tool's default is anything other than auto-resolve, your tool is wrong by default.
Default 4 — Grouping, with a 60-second correlation window and a service rule
When an upstream service breaks, downstream services break with a lag of seconds. Without grouping, a single incident pages on-call ten times. The third page nullifies the focus the first one demanded.
A correct default:
# Alertmanager / Sutrace-equivalent grouping config
route:
group_by: ['service', 'cluster']
group_wait: 30s # wait 30s for related alerts before paging
group_interval: 5m # consolidate updates every 5m
repeat_interval: 4h # don't re-page for 4h after first page
receiver: 'oncall'
routes:
- match:
severity: page
receiver: 'oncall-page'
- match:
severity: warning
receiver: 'oncall-slack'
The two essentials: group_wait: 30s (wait long enough to consolidate the upstream + downstream alerts of one incident) and group_by: ['service', 'cluster'] (group alerts that share a service or cluster — the most useful axes for "is this one incident or many?").
What most vendors ship: no grouping at all by default. Each alert is its own page. Three alerts in 90 seconds = three phone buzzes.
Default 5 — Severity tiers with real, different delivery channels
Most teams have P1, P2, P3 in their alert metadata. Few teams have different routing for them. The result: P2 also pages, P3 also pages, and the tiers degrade into "P1, big P1, slightly less big P1."
The right default routes by severity:
| Severity | Channel | Time-of-day |
|---|---|---|
| P1 (page) | PagerDuty / Opsgenie phone | 24/7 |
| P2 (urgent but not waking) | Slack channel + email | 24/7, but as message not page |
| P3 (informational) | Ticket / digest | Business hours only |
| P4 (logged) | Dashboard / weekly report | Never delivered live |
When you make the channels actually different, severity becomes a meaningful axis. A P2 doesn't wake anyone but is still tracked. A P3 doesn't generate a Slack ping. A P4 disappears into the dashboard where it belongs.
What most vendors ship: severity is a label, not a routing key. Every severity hits the same channel. Severity becomes vestigial.
The line your CFO will appreciate
Alert fatigue isn't only a humane concern. It's a financial one. The 2026 Runframe State of Incident Management survey, DevOps.com's on-call rotation guide, and Atlassian's incident-management research all converge on the same conclusion — chronic alert fatigue increases turnover among senior engineers, who are the most expensive to replace.
The arithmetic is simple. A senior SRE costs $250–400k loaded. Replace one a year due to on-call burnout and you've spent the cost of two years' worth of any observability tool. The five defaults above are zero-marginal-cost product decisions that meaningfully reduce that risk. Vendors who don't ship them are choosing trial-time conversion over customer retention.
Why "tuned by default" is harder for vendors than it looks
There's a real reason vendors don't ship tuned defaults. Three of them, actually:
1. Trial-time conversion incentive. A quiet product feels broken in week one. Loud defaults convert trials.
2. Customer support cost. "Why isn't this firing?" is a more common support ticket than "this is firing too much." The first ticket gets logged; the second doesn't, because the customer just silences the rule.
3. Default tuning is opinionated. Different teams have different SLOs. There's no single right for: 2m value. Vendors avoid the opinion to avoid being wrong for any specific customer.
The first two are bad reasons. The third is real but solvable — ship a small set of opinionated defaults that work for 80% of teams, plus an obvious "tune for my SLO" wizard. The hard part is the institutional courage to ship the small set rather than the kitchen-sink set.
The Hamza framing applies here too
From the dev.to piece on-call burnout — what incident data doesn't show: the same number of alerts feels different by context. Tuned-by-default defaults reduce the count of alerts; they also reduce the cost-per-alert because the alerts that remain are higher-confidence. Both axes matter. We covered the cognitive side in the alert-fatigue pillar post.
What this looks like in Sutrace
Concretely, the five defaults are on by default in our packaged rule library:
- Burn-rate alerts. Every service we auto-instrument gets a multi-window burn-rate alert if a
sloannotation is present. Default windows: 1h+5m for fast burn, 6h+30m for slow burn. Default budgets per tier published on the pricing page. - Saturation alerts. Disk, memory, connection pool — all
predict_linear-based with absolute floor. CPU is not alerted on by default; it's a dashboard signal. - Auto-resolve. On for every rule. There is no UI to disable it for a single rule (you'd have to reach into the API). We make it slightly inconvenient on purpose.
- Grouping. 60-second window, grouped by
service+environmentby default. Configurable but not removable. - Severity routing. P1 pages, P2 Slacks, P3 tickets. Different channels by default. Time-of-day awareness on by default for non-P1 severities.
We don't claim this is novel — it's straight out of Google's SRE workbook and the on-call research community's consensus (Rootly, FireHydrant, Atlassian, PagerDuty, Catchpoint, the lot). What's novel is that it's on by default. We chose the small-set-shipped-with-courage path. If you want the kitchen-sink set, you'll have to copy them in by hand.
The audit, again
If you're not switching tools and just want to fix what you have: the audit from the pillar post is the next read. Three steps:
- Print last week's alerts. Mark each as urgent / actionable / imminent — three checkboxes per row.
- Delete every rule that fired more than 20 times last month and was ignored.
- Add an owner to every remaining rule. No-owner rules expire after 30 days.
The honest conversation is: most teams already know which 50% of their rules are noise. They keep them because deleting them feels risky. The risk of keeping them, integrated over a year, is higher than the risk of deleting them. Pick a Friday afternoon and do it.
Closing
Tuned-by-default isn't a feature on a slide. It's a posture. It says: we'd rather ship a quiet product on day one and have you in a healthy place on year three, than ship a loud product on day one and have you burnt out on year three. The five defaults are the operating manual.
If your current vendor ships the kitchen-sink defaults and the on-call rotation hates them, the fastest fix is the audit above. The slower fix is moving to a tool that doesn't make you re-tune from scratch — that's the Sutrace use-case page and the Datadog comparison. The third post in the cluster, after-hours interruption load, covers the off-hours dimension that this post doesn't.
Five defaults. Ship the small set. Quiet wins.