all posts

otel

Prometheus at scale, the Cloudflare 200-sample rule, and the day you graduate

Why a single Prometheus instance hits a wall around 1–2M active series, what Cloudflare's sample_limit:200 actually defends against, and how to know when you've graduated to Thanos / Cortex / Mimir.

By Akshay Sarode· September 19, 2025· 13 min readprometheusobservabilitycardinalitythanos

Prometheus at scale, the Cloudflare 200-sample rule, and the day you graduate

TL;DR. Single-instance Prometheus is the right answer for most teams up to about 1–2 million active time-series. Beyond that, query latency degrades sharply and you start federating, sharding, or moving to Thanos / Cortex / Mimir. Cloudflare's sample_limit: 200 per-target setting is the single most underrated defence in the entire stack — it stops one bad exporter from poisoning the cluster. This post covers the ceiling, the limit, the cognitive-load tax of multi-instance Prometheus (Chronosphere's "which instance do I query?" problem), and the exact signals that tell you it's time to graduate. We end with a decision tree: stay on Prometheus, scale Prometheus, or move to OTLP-native storage. If the bill or the latency is starting to bite, this is the post to read first.

The single-instance ceiling, in numbers

Three independent sources put the ceiling in the same range:

Two million series sounds like a lot until you do the multiplication. Take a Kubernetes cluster with 500 pods, each emitting 100 metrics × 5 labels each. That's 250,000 series per snapshot of just the standard kube-state stuff. Add four envs × per-pod cardinality from labels like pod_name (which churns on every deploy), and the upper bound is comfortably past the ceiling within a year.

Cloudflare's sample_limit: 200 rule

Cloudflare runs Prometheus across thousands of nodes. Their Prometheus-at-scale post describes the operational pattern, and the most important single line of YAML in the entire post is this one:

scrape_configs:
  - job_name: 'application'
    sample_limit: 200
    static_configs:
      - targets: ['app-1:9100', 'app-2:9100']

sample_limit: 200 means: if a scrape produces more than 200 samples, the scrape is rejected entirely. The scrape is recorded as a failure (you can alert on that) and no data is written to the TSDB.

Why 200? Because Cloudflare's apps are written by humans who occasionally write a histogram bucket configuration that explodes. One developer adds five new labels to a metric, and the next scrape is 50,000 samples instead of 50. Without sample_limit, that scrape lands, the WAL bloats, query latency tanks across the cluster, and the on-call team spends an evening backing it out. With sample_limit: 200, the scrape is rejected, an alert fires, and the developer reverts the change before it leaves staging.

The number doesn't have to be 200 — it's a per-application choice. The pattern is: every scrape config gets a sample_limit. The default in the upstream Prometheus config is "no limit," and that default is a footgun.

Three other defences worth knowing

sample_limit is the headliner. Three siblings are worth the same attention:

label_limit. Per-target maximum number of labels per metric. Default is 0 (unbounded). Set it to something like 32 — anything more is almost always a config error.

label_value_length_limit. Per-label-value max length in bytes. Default 0. Set it to 200. URLs and JSON-as-label-value are the usual offenders.

target_limit. Per-scrape-config maximum number of targets. Stops a runaway service-discovery config from suddenly creating 10,000 targets after a misconfigured Kubernetes label.

Together, these four limits constitute a defence-in-depth posture. None of them are on by default. All of them should be on for any team larger than five engineers.

What single-instance failure looks like

The failure mode isn't "Prometheus crashes." It's worse — Prometheus continues to run, but everything around it degrades:

  1. Query latency p99 climbs. A query that took 2s now takes 30s.
  2. WAL replay time on restart climbs. A pod restart that took 90s now takes 25 minutes.
  3. OOM kills. The cgroup memory limit is exceeded mid-query, the pod dies, the WAL replays again.
  4. Alerting backs up. Recording rules stop catching up, alerts fire late, on-call debugs the wrong window.

The signal is usually a slow Grafana dashboard. By the time someone notices the dashboard, the cluster has been degraded for a week.

The Chronosphere quote — cognitive-load tax of multi-instance

The hardest part of scaling Prometheus is not technical. It's organisational. From Chronosphere's article on Prometheus scaling challenges:

Each time a user runs a query, they must first remember which instance to query their data from.

Read that twice. In a federated Prometheus setup, your users (engineers) carry the mental model of "metrics for service A live on instance 3, metrics for service B live on instance 7." That model is inherently lossy — they'll forget, they'll guess wrong, and the tooling won't correct them. The cognitive-load tax compounds with every new service and every new instance.

This is the real reason teams graduate to Thanos / Cortex / Mimir / managed: not because Prometheus can't store the data, but because asking engineers to remember the data layout is a tax that doesn't show up on any spreadsheet but slowly erodes velocity.

Graduating — the four real options

When you outgrow single-instance, the choices are:

Option 1: Federation

Multiple Prometheus instances, each with its own targets, plus a "federator" that scrapes summaries from each. Cheap to set up, severely limited query expressiveness (only the federated metrics are available globally), and the cognitive-load tax is worse because now your engineers also remember which metrics are federated.

When this wins: small teams that need a quick stopgap and have a small set of "platform" metrics to federate.

Option 2: Thanos

Sidecar pattern — each Prometheus has a Thanos sidecar that ships TSDB blocks to object storage. Querier component fans out across all Prometheuses for queries that hit the recent window, and reads from object storage for older data. Thanos docs cover the architecture in detail.

When this wins: teams already deep in Prometheus, with strong S3 / GCS / MinIO posture, and a willingness to operate the sidecars + querier + compactor + store-gateway. Operationally heavy.

Option 3: Cortex / Mimir

Multi-tenant horizontally-scalable Prometheus, originally Cortex, then Mimir as the Grafana fork. Push-based ingestion (Prometheus remote_write) into a distributed write path with hash-ring sharding. Grafana Mimir docs are the reference. Mimir is the gold standard for "we have hundreds of Prometheus sources and want a single global query layer."

When this wins: large platform teams with the operational appetite to run a sharded distributed system. Not a small-team choice.

Option 4: OTLP-native storage

Ditch the Prometheus-as-storage assumption. Keep Prometheus as a scraper (or migrate scrapes to the OTel Collector's prometheusreceiver), and write into OTLP-native storage like ClickHouse. We documented the architecture in OTel Collector to ClickHouse — a quickstart you can run in an hour and the OTel-backend landscape in the protocol-war pillar.

When this wins: teams that want to converge on OTLP as the universal protocol, want a single storage layer for traces+metrics+logs, and want to leave the Prometheus operational footprint behind.

The decision tree

If you're staring at a slow Grafana dashboard right now:

1. Are you under 1M active series?
Yes → it's not a scale problem, it's a query or hardware problem. Run topk(10, count by (__name__)({__name__=~".+"})) and look for surprises. Profile the slow query.

2. Are you over 1M and over the budget you've set for ops effort?
Yes → managed (Grafana Cloud, Sutrace, ClickStack). The Grafana Cloud comparison covers the trade-offs.

3. Are you over 1M with engineering capacity to operate distributed systems?
Yes → Mimir. You'll be happy in two years; the first six months are painful.

4. Are you starting fresh and OTel-native already?
Skip Prometheus-as-storage. Use OTel Collector + ClickHouse, or a managed OTLP-native backend.

What sample_limit: 200 doesn't fix

It's worth being clear about what sample_limit does not solve:

  • Slow queries. A sum without(instance) over a million-series metric is slow regardless of sample limits. You need recording rules or a pre-aggregation layer.
  • Cardinality from RPCs. If customer_id is on every metric and you have 100k customers, no sample_limit saves you.
  • Long retention. Single-instance Prometheus is bad at long retention because of how the TSDB compacts. You want object storage offload (Thanos) or columnar storage (ClickHouse).
  • Cross-region queries. If you have one Prometheus per region, sample_limit doesn't help you query across them — that's a federation / Thanos / Mimir problem.

sample_limit is a defence. It stops bad scrapes from making everything worse. It doesn't make a 5M-series cluster fast.

The cost of not setting limits

Two stories from real teams.

Story 1. A B2B SaaS adds a customer_id label to one widely-emitted metric. Their Prometheus has no sample_limit. Within an hour, scrapes are returning 50x more samples than the previous hour. The WAL bloats, the next pod restart takes 40 minutes to replay. They lose 40 minutes of monitoring during the worst possible window — a deploy. Whether the deploy was the cause of the bill spike is unknowable because the data wasn't recorded.

Story 2. Same shape, different team. They had sample_limit: 500 on every scrape. The same customer_id PR went out. The next scrape was rejected. An alert fired. The on-call engineer reverted the PR within 12 minutes. No data loss. No invoice spike (this team was on Datadog and the PR would have cost them an estimated $3,200/month for the cardinality alone — see the cardinality cost-attribution post for the math).

The difference between these two stories is one line of YAML.

The HN megathread context

The HN $83K renewal thread, the What instead of Datadog thread, and the cheaper-Datadog thread all touch the same dynamic — teams hitting an observability cost cliff. The unspoken substrate is usually that the team's Prometheus has been allowed to grow unchecked, and either they switched to Datadog and got the cardinality bill, or they stayed on Prometheus and got the operational bill. sample_limit and its siblings are the cheapest insurance against either.

ClickHouse's roundup of OTel-compatible platforms frames the alternative — the day you graduate, OTLP-native is the move that doesn't have a third migration in five years.

What to do this week

Three concrete actions:

  1. Audit your scrape configs. Add sample_limit, label_limit, label_value_length_limit, and target_limit to every job. Pick numbers that allow your largest legitimate scrape with 50% headroom.
  2. Run topk(20, count by (__name__)({__name__=~".+"})). Find your top 20 metrics by series. For each, ask: do all those series matter? If not, drop labels at the relabel step.
  3. Decide your graduation point. Pick a number — 1M series, or 50% of your hardware capacity — that triggers the migration project. Write it down. Without a number, the migration starts the day everything is on fire.

For the architectural side of the answer — moving cardinality cost attribution upstream of the bill — the cardinality cost-attribution post is the companion piece.

Closing

Prometheus is a great tool that has a clear ceiling, and the ceiling is reachable on a normal year of growth. The defence is configuration discipline (the four limits) and a graduation plan (one of the four options) before the dashboard is slow, not after. The day you graduate, OTLP-native storage is the move that pays back across multiple migrations — see the Sutrace OTel backend page for the managed version and the OTel-Collector + ClickHouse quickstart for the self-host one.

The single most underrated line of YAML in the industry is sample_limit: 200. Use it.