---
title: Tuned by default — the five alerting defaults most observability vendors skip
description: Why "alert on everything" is the structural cause of alert fatigue, and the five defaults that should ship in the box. With concrete config you can copy.
author: Akshay Sarode
published: 2026-02-04
updated: 2026-02-04
cluster: c8-alert-fatigue
tags: [alerting, on-call, sre, observability]
reading: 11 min
hero: A side-by-side of two YAML files — vendor default versus tuned default.
---

# Tuned by default — the five alerting defaults most observability vendors skip

**TL;DR.** Most observability vendors ship "alert on everything" defaults — CPU thresholds, disk thresholds, anomaly detection on every metric, every check ringing PagerDuty at P1. The result is the alert-fatigue cycle the [PagerDuty research](https://www.pagerduty.com/resources/digital-operations/learn/alert-fatigue/) and the [Catchpoint SRE report](https://oneuptime.com/blog/post/2026-03-05-alert-fatigue-ai-on-call/view) keep documenting — 50 alerts a week, 2-5% actionable, 70% of SREs naming it as a top-3 concern. This is a structural problem, not a customer problem. Vendors ship it because the alternative — a quiet product on day one — feels broken to the buyer. This post lists the five defaults that *should* ship in the box, with the YAML/JSON shape, the rationale per default, and the line your CFO will appreciate. Companion to the [alert-fatigue pillar post](/blog/alert-fatigue-pillar-cognitive-fragmentation) which covers the cognitive-fragmentation framing.

If you've ever silenced a vendor's default rule and felt slightly guilty, this post is your justification.

## The structural problem in one paragraph

A new buyer evaluating an observability vendor opens a free trial. On day one, the dashboards must show data and the alerts must show *something* firing — otherwise it feels broken. So the vendor ships a starter pack of rules: CPU > 80%, disk > 85%, latency > 2σ above baseline. The buyer sees alerts, feels the product is working, signs the contract. Six months later the team has 50+ rules, most of them descended from the starter pack, and 95% noise. The trial-time decision is paying alert-fatigue tax for the next three years.

The fix isn't "ship no defaults." It's "ship five defaults that survive a year of production." Below is the list.

## Default 1 — SLO burn-rate alerts on user-facing services, multi-window multi-burn-rate

This is *the* canonical Google SRE workbook pattern. The shape:

```yaml
# Burn-rate alert template — copy and customise per service
- alert: HighErrorBudgetBurn-FastBurn
  expr: |
    (
      sum(rate(http_requests_total{job="checkout",code=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{job="checkout"}[1h]))
    ) > (14.4 * 0.001)
    AND
    (
      sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="checkout"}[5m]))
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: page
    service: checkout
    slo: 99.9
  annotations:
    summary: "Checkout: error budget burn rate >14.4x over 1h and 5m"
    runbook: "https://internal.example.com/runbooks/checkout-error-budget"
```

Two conditions, both must be true. The first window catches fast burns (a real incident). The second window suppresses the alert when it's a 5-minute blip that already self-resolved. The math: at a 14.4x burn rate, a 99.9% SLO error budget is exhausted in 2 hours, which is squarely in "page someone now" territory.

Why this is the right default: it's *user-impact* rather than internal. It uses two windows so it doesn't fire on noise. It has a runbook field. It has an owner via the `service` label. Every box in the [Google SRE checklist](https://sre.google/workbook/alerting-on-slos/) is ticked.

What most vendors ship instead: a threshold on `http_5xx_error_rate > 1%` with no second window. Fires on 30-second blips. Wakes someone up. Self-resolves before they ack.

> [!NOTE]
> Diagram: Two windows on a graph (1h and 5m). Two-condition alert fires only when both are above threshold; single-condition fires on every blip.

## Default 2 — Saturation alerts that watch the *rate of approach*, not the absolute value

The "disk > 85%" alert is the canonical bad alert. Disk at 85% is an information signal, not an action signal. The action signal is "disk is filling at a rate that will exhaust capacity in < N hours."

Better default:

```yaml
- alert: DiskFillRateUrgent
  expr: |
    predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
    AND
    node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.20
  for: 15m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.instance }} root volume will fill within 4h at current rate"
    runbook: "https://internal.example.com/runbooks/disk-fill"
```

Reads as: "if linear regression of available bytes over the last 6 hours predicts we'll be at zero in 4 hours, *and* we're already below 20% free, page." The combined condition rules out short-lived growth bursts (logrotate, large DB compaction) that don't represent a real fill trend.

Compare to a vendor default of "disk > 85% page": the latter fires on every healthy machine that runs steady-state at 85% utilisation, multiple times per week. The first one fires only when something is actually wrong. The signal-to-noise difference is dramatic.

This is the principle: alert on the *trajectory*, not the absolute. CPU at 80% sustained with predicted-saturated-in-N-minutes is alertable. CPU at 80% for 60 seconds is not.

## Default 3 — Auto-resolve, on every rule, no exceptions

Every alert must have a resolved condition. If the metric returns to healthy, the alert closes itself. The on-call engineer doesn't ack it manually unless they want to add context.

In Prometheus this is automatic — when the condition is no longer true, the alert resolves. In Datadog and many commercial tools, "resolve when condition is false" is *opt-in* per rule. That's the configuration error responsible for half the world's stale alerts.

Concretely, in Datadog:

```
alert_settings:
  notify_no_data: false        # DO NOT page when data goes silent
  no_data_timeframe: 30        # only consider missing data after 30 min
  notify_audit: false
  on_missing_data: resolve     # NOT show_no_data — this is the bug
  evaluation_delay: 60
  renotify_interval: 0         # don't spam after first page
  evaluation_window: 5m
```

The line that matters: `on_missing_data: resolve`. The vendor default in many shops is `show_no_data` which keeps the alert open and forces a manual ack. That's a noise tax on every monitoring agent restart, every brief network blip, every reboot.

The principle: the on-call engineer's time is too valuable to spend acknowledging alerts that have already self-resolved. If your tool's default is anything other than auto-resolve, your tool is wrong by default.

## Default 4 — Grouping, with a 60-second correlation window and a service rule

When an upstream service breaks, downstream services break with a lag of seconds. Without grouping, a single incident pages on-call ten times. The third page nullifies the focus the first one demanded.

A correct default:

```yaml
# Alertmanager / Sutrace-equivalent grouping config
route:
  group_by: ['service', 'cluster']
  group_wait: 30s              # wait 30s for related alerts before paging
  group_interval: 5m           # consolidate updates every 5m
  repeat_interval: 4h          # don't re-page for 4h after first page
  receiver: 'oncall'
  routes:
    - match:
        severity: page
      receiver: 'oncall-page'
    - match:
        severity: warning
      receiver: 'oncall-slack'
```

The two essentials: `group_wait: 30s` (wait long enough to consolidate the upstream + downstream alerts of one incident) and `group_by: ['service', 'cluster']` (group alerts that share a service or cluster — the most useful axes for "is this one incident or many?").

What most vendors ship: no grouping at all by default. Each alert is its own page. Three alerts in 90 seconds = three phone buzzes.

## Default 5 — Severity tiers with *real, different* delivery channels

Most teams have P1, P2, P3 in their alert metadata. Few teams have *different routing* for them. The result: P2 also pages, P3 also pages, and the tiers degrade into "P1, big P1, slightly less big P1."

The right default routes by severity:

| Severity | Channel | Time-of-day |
|---|---|---|
| P1 (page) | PagerDuty / Opsgenie phone | 24/7 |
| P2 (urgent but not waking) | Slack channel + email | 24/7, but as message not page |
| P3 (informational) | Ticket / digest | Business hours only |
| P4 (logged) | Dashboard / weekly report | Never delivered live |

When you make the channels actually different, severity becomes a meaningful axis. A P2 doesn't wake anyone but is still tracked. A P3 doesn't generate a Slack ping. A P4 disappears into the dashboard where it belongs.

What most vendors ship: severity is a label, not a routing key. Every severity hits the same channel. Severity becomes vestigial.

## The line your CFO will appreciate

Alert fatigue isn't only a humane concern. It's a financial one. The 2026 [Runframe State of Incident Management](https://runframe.io/blog/state-of-incident-management-2025) survey, [DevOps.com's on-call rotation guide](https://devops.com/on-call-rotation-best-practices-reducing-burnout-and-improving-response/), and [Atlassian's incident-management research](https://www.atlassian.com/incident-management/on-call/alert-fatigue) all converge on the same conclusion — chronic alert fatigue increases turnover among senior engineers, who are the most expensive to replace.

The arithmetic is simple. A senior SRE costs $250–400k loaded. Replace one a year due to on-call burnout and you've spent the cost of two years' worth of any observability tool. The five defaults above are zero-marginal-cost product decisions that meaningfully reduce that risk. Vendors who don't ship them are choosing trial-time conversion over customer retention.

## Why "tuned by default" is harder for vendors than it looks

There's a real reason vendors don't ship tuned defaults. Three of them, actually:

**1. Trial-time conversion incentive.** A quiet product feels broken in week one. Loud defaults convert trials.

**2. Customer support cost.** "Why isn't this firing?" is a more common support ticket than "this is firing too much." The first ticket gets logged; the second doesn't, because the customer just silences the rule.

**3. Default tuning is opinionated.** Different teams have different SLOs. There's no single right `for: 2m` value. Vendors avoid the opinion to avoid being wrong for any specific customer.

The first two are bad reasons. The third is real but solvable — ship a small set of opinionated defaults that work for 80% of teams, plus an obvious "tune for my SLO" wizard. The hard part is the institutional courage to ship the small set rather than the kitchen-sink set.

> [!NOTE]
> Chart: Vendor-shipped default rule counts versus median-customer post-tuning rule count, on a sample of N vendors. Most ship 100+, customers keep <20.

## The Hamza framing applies here too

From the dev.to piece [on-call burnout — what incident data doesn't show](https://dev.to/hamza_2315/on-call-burnout-what-incident-data-doesnt-show-2kap): the same number of alerts feels different by context. Tuned-by-default defaults reduce the *count* of alerts; they also reduce the *cost-per-alert* because the alerts that remain are higher-confidence. Both axes matter. We covered the cognitive side in [the alert-fatigue pillar post](/blog/alert-fatigue-pillar-cognitive-fragmentation).

## What this looks like in Sutrace

Concretely, the five defaults are on by default in our packaged rule library:

1. **Burn-rate alerts.** Every service we auto-instrument gets a multi-window burn-rate alert if a `slo` annotation is present. Default windows: 1h+5m for fast burn, 6h+30m for slow burn. Default budgets per tier published on the [pricing page](/pricing).
2. **Saturation alerts.** Disk, memory, connection pool — all `predict_linear`-based with absolute floor. CPU is *not* alerted on by default; it's a dashboard signal.
3. **Auto-resolve.** On for every rule. There is no UI to disable it for a single rule (you'd have to reach into the API). We make it slightly inconvenient on purpose.
4. **Grouping.** 60-second window, grouped by `service` + `environment` by default. Configurable but not removable.
5. **Severity routing.** P1 pages, P2 Slacks, P3 tickets. Different channels by default. Time-of-day awareness on by default for non-P1 severities.

We don't claim this is novel — it's straight out of Google's SRE workbook and the on-call research community's consensus (Rootly, FireHydrant, Atlassian, PagerDuty, Catchpoint, the lot). What's novel is that it's *on by default*. We chose the small-set-shipped-with-courage path. If you want the kitchen-sink set, you'll have to copy them in by hand.

## The audit, again

If you're not switching tools and just want to fix what you have: the audit from the [pillar post](/blog/alert-fatigue-pillar-cognitive-fragmentation) is the next read. Three steps:

1. Print last week's alerts. Mark each as urgent / actionable / imminent — three checkboxes per row.
2. Delete every rule that fired more than 20 times last month and was ignored.
3. Add an owner to every remaining rule. No-owner rules expire after 30 days.

The honest conversation is: most teams already know which 50% of their rules are noise. They keep them because deleting them feels risky. The risk of keeping them, integrated over a year, is higher than the risk of deleting them. Pick a Friday afternoon and do it.

## Closing

Tuned-by-default isn't a feature on a slide. It's a posture. It says: we'd rather ship a quiet product on day one and have you in a healthy place on year three, than ship a loud product on day one and have you burnt out on year three. The five defaults are the operating manual.

If your current vendor ships the kitchen-sink defaults and the on-call rotation hates them, the fastest fix is the audit above. The slower fix is moving to a tool that doesn't make you re-tune from scratch — that's the [Sutrace use-case page](/use-cases/opentelemetry-backend) and the [Datadog comparison](/compare/sutrace-vs-datadog). The third post in the cluster, [after-hours interruption load](/blog/after-hours-interruption-load-the-statistic-pagerduty-doesnt-publish), covers the off-hours dimension that this post doesn't.

Five defaults. Ship the small set. Quiet wins.
