datadog
Log volume, cardinality, and the 50M time-series budget
A technical deep-dive on the architectural ceiling of single-instance Prometheus, why Cloudflare's sample_limit:200 matters, and how to design a 50M time-series budget.
Log volume, cardinality, and the 50M time-series budget
TL;DR. A single-instance Prometheus tops out at 1–2 million active time-series before query latency degrades to unusable. Past that you're sharding, federating, or paying. The right design pattern is Cloudflare's: a per-scrape sample_limit: 200 guardrail at the collector, recording-rule pre-aggregation for dashboard queries, and a hard budget per service. This post walks through the architectural ceiling, the math behind a 50M-series budget across a fleet, and the design choices that let you operate at that scale without burning a finance team. We end with the practical config that's been working in production for the last 18 months.
This is the technical companion to the cardinality cost-attribution post. That post is about the product design; this one is about the bytes.
The ceiling, with sources
Three independent sources put the single-instance Prometheus ceiling in the same range:
- Cloudflare's at-scale post — production experience operating thousands of Prometheus instances. They cite per-instance budgets and the
sample_limitmitigation in detail. - Last9's challenges-with-Prometheus post — explicit 1–2M active series number, with degradation modes (memory pressure, query timeout, WAL replay times).
- Sysdig's Prometheus-at-scale piece — same number, different lens.
- Chronosphere's scaling guide — confirms the soft ceiling and lists the standard sharding approaches (federation, Thanos, Cortex, Mimir).
The number is consistent because the bottleneck is consistent: Prometheus keeps active series in memory, and the index lookups on high-cardinality data become CPU-bound. You can push past 2M with a fat instance and careful tuning, but the marginal cost of each additional million series rises sharply.
What "active series" actually means
A series is uniquely identified by (metric_name, label_set). "Active" means it received a sample in the last few hours (the exact window depends on retention config). The TSDB indexes these in memory.
If your service emits http_requests_total with labels route, method, status, you have route × method × status potential series. Whether they're all active depends on whether all combinations actually occur in production. A 404 on POST /api/admin/delete-everything that's never been hit doesn't count.
This is why cardinality estimation is hard: the upper bound is a multiplication, but the actual count is data-dependent. Tools like count by (__name__) give you the live number. Run it. Most teams find 30–50% of the upper-bound is dead, and 5–10% of metrics are 80% of the cardinality.
The Cloudflare pattern, in full
The Cloudflare post describes a five-layer defense:
1. sample_limit per scrape target.
scrape_configs:
- job_name: 'app'
sample_limit: 200
static_configs:
- targets: ['app:9090']
If a scrape returns more than 200 samples, the entire scrape is rejected. This means one bad service can't poison the TSDB. The cost: you'll occasionally lose data from a service that legitimately needs more than 200 samples — which is the warning signal that you need to refactor that service's metrics.
2. Per-metric label limits. A Prometheus relabeling rule can drop labels above a threshold. Less commonly used because relabeling rules are runtime-expensive, but they're in the toolkit.
3. Recording rules for dashboards. Pre-aggregate the dimensions humans actually query.
groups:
- name: dashboard_aggs
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
- record: service_route:http_requests:rate5m
expr: sum by (service, route) (rate(http_requests_total[5m]))
Dashboards query the recording rules, not the raw metrics. The raw metrics are kept (you can drill in) but they're not on the hot path. This is the single most important practice and the one most teams skip.
4. Federation for cross-instance queries. If you need to query across shards, federation aggregates pre-computed sums upward. Don't federate raw series; federate recording rules.
5. Series budget per service. Document how many series each service is "allowed." Review at incident post-mortems. Treat overruns as architectural debt.
A 50M time-series budget, sized
Suppose you operate 50 services with a target of 50 million active series across the fleet. Math:
- 50 services
- Average budget: 1M series each
- Tail allowance: 3 services at 5M (the high-cardinality ones — payment, search, recommender)
- Sharding: 3 Prometheus shards at ~17M each (over the soft ceiling, requires Thanos/Mimir)
Or, more conservatively:
- 50 services × 1M = 50M
- 25 shards × 2M each (each shard within single-instance ceiling)
- Federation rolls up dashboard queries
The second is operationally simpler — you avoid the Thanos/Mimir tax. The downside is more shards to operate. Most teams pick the second once they're past 20M.
Where logs fit
Logs are different. High cardinality is fine in logs because the storage model is different: logs are unstructured text indexed on a few canonical fields, not a label set. You can put customer_id in a log line and it costs you ingest + storage, but it doesn't multiply against other fields.
This is the architectural answer to "how do I keep customer_id debuggability without blowing up metrics?" — log it, don't tag the metric. The metric stays low-cardinality (good for aggregation, dashboards, alerts), the log carries the high-cardinality dimension (good for debugging, search, drill-in).
The trap: most teams put high-cardinality dimensions on metrics out of habit, then watch the bill explode. The fix is a discipline shift — when adding a label, ask "is this for a dashboard or for a debugging session?" If it's the latter, it belongs in logs or traces.
What 50M series costs
Self-hosted, on commodity hardware:
- 25 Prometheus shards, each on a 32GB node = $50–80/node/month on cloud = $1,250–$2,000/mo
- Long-term storage (Thanos object store) ~$200/mo at this scale
- Operator time: 0.25 FTE (one engineer's part-time attention)
Total: roughly $2,000/mo + 0.25 FTE = ~$8,000/mo all-in.
Commercial Prometheus-compatible (Grafana Cloud, Chronosphere, Last9) at this scale:
- Per-series pricing varies by vendor; ballpark $0.001–$0.005 per active series per month
- 50M series × $0.002 = $100,000/month
Datadog at this scale:
- Most series would land as custom metrics. At ~$0.05 per custom metric per month with enterprise discount of 50%, you're looking at $1.25M/month. This is the regime where companies show up in the Pragmatic Engineer's $65M Coinbase post.
The order-of-magnitude gap is what drives migration decisions. We covered the migration playbook in the OTel migration checklist.
The Sutrace approach
We bill on ingest (predictable) with cardinality monitored but not multiplied. The architectural backstops are:
sample_limit-style per-source rate limits at the collector. Default 1,000, configurable up.- Per-service series budgets, surfaced in the UI. Default 100K series per service; raise on request.
- Recording-rule layer pre-computed for dashboards. Customers can add custom recording rules.
- Cardinality attribution per metric, with a "what would happen if I added this label" preview.
The combination prevents the runaway scrape from eating the TSDB and prevents the surprise renewal. We're not the only ones doing this — SigNoz and Chronosphere have versions of (1)–(3). We're focused on (4) because the feedback loop at write-time is what's missing in most products.
The full pricing model is on the pricing page. The architectural piece is in Sutrace as a Datadog alternative.
The practical config
If you take one thing from this post, take this collector config. It's the minimum viable cardinality-safe Prometheus setup:
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'app'
sample_limit: 200 # Cloudflare guardrail
label_limit: 30 # max labels per metric
label_name_length_limit: 200
label_value_length_limit: 200
static_configs:
- targets: ['app:9090']
rule_files:
- 'recording_rules.yaml'
remote_write:
- url: 'https://ingest.example.com/v1/write'
queue_config:
capacity: 10000
max_samples_per_send: 1000
And recording_rules.yaml:
groups:
- name: dashboard_aggregations
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: sum by (service) (rate(http_requests_total[5m]))
- record: service_status:http_requests:rate5m
expr: sum by (service, status_class) (rate(http_requests_total[5m]))
- record: service:http_duration_p95
expr: histogram_quantile(0.95, sum by (service, le) (rate(http_duration_bucket[5m])))
Note status_class not status — you bucket the 5xx vs 4xx vs 2xx, you don't keep every status code. This is the recording-rule pattern: aggregate to the dimension humans actually query.
What to monitor about your monitoring
This is the meta-layer most teams skip. You should have alerts on:
- Active series count per Prometheus, with thresholds at 80% and 95% of the soft ceiling.
- Scrape failures (rejected by
sample_limit) per target. Spikes here mean a service grew its cardinality. - Recording-rule evaluation latency. Slow rules indicate cardinality on the wrong dimension.
- Remote-write queue depth. Backpressure means you're producing faster than you can store.
Without these, you only find out you're past the ceiling when the dashboards stop loading.
The honest counter-argument
You could say: "This is a lot of operational overhead for what should be a managed service." Fair. The answer depends on your scale.
Below 5M series total: don't bother. Use a managed service. The economics aren't there for self-host.
5M–50M series: it depends on whether you have 0.25 FTE to operate. The cost gap is real but so is the on-call burden.
Above 50M series: self-host or use a vendor that's priced on ingest, not cardinality. The Datadog model becomes prohibitive past this point.
This is the regime where Sutrace's ingest-based pricing actually saves money. Below 5M series the savings are nice-to-have; above 50M they're business-critical.
Closing
The 50M time-series budget isn't a goal. It's a recognition that the upper bound of useful metrics in a real fleet is bounded by physics (TSDB performance) and economics (per-series billing). The teams that operate well at this scale do four things: they enforce per-scrape sample limits, they pre-aggregate via recording rules, they budget per service, and they put high-cardinality dimensions in logs rather than metrics.
If your bill or your TSDB latency is suggesting you've blown past the soft ceiling, the audit is the first move. The cardinality post has the audit script. Sutrace pricing is public if you want to see what an ingest-based alternative looks like.
The architectural lesson is older than this post: bound your inputs at the edge, aggregate before you query, attribute before you bill. The Prometheus ecosystem learned this; the SaaS pricing models haven't caught up yet. They will.