all posts

datadog

Cardinality cost attribution, before the bill arrives

Why label sprawl is an architectural problem, how Datadog's pricing reacts to it, and what cost attribution before ingest looks like in practice.

By Akshay Sarode· January 22, 2026· 11 min readdatadogobservabilitycardinalityprometheus

Cardinality cost attribution, before the bill arrives

TL;DR. Cardinality is the number of unique time-series your metrics produce, and in 2026 it is the dominant cost driver in commercial observability. Datadog charges per custom metric where each unique tag combination counts. The right architectural answer is not "use fewer tags" — it is to attribute cardinality cost to a service and team before the data hits storage. That means a per-service series budget, a sampler that enforces it, and a UI that shows engineers their impact in dollars before they ship the change. This post explains how we got here, why label sprawl is structural rather than a discipline problem, and what a budget-first observability layer looks like.

The framing — straight from ClickHouse

On 7 January 2026, Mike Shi closed ClickHouse's observability year-in-review with a line worth quoting in full: "high cardinality became a huge cost driver." He's not wrong, and the entire industry felt it through 2025. The HN megathreads — the $83K renewal, the $65M Coinbase bill, the cheaper-Datadog discussion — all rotate around the same gravity well. The bill grew. Headcount didn't. Where did the data go?

It went into the cardinality dimension. And the bill grew there because that's the dimension nobody's tooling shows them until invoice day.

What cardinality actually is

A time-series is uniquely identified by its labels. Take a metric:

http_requests_total{service="checkout", route="/cart", method="POST", status="200", region="eu-west-3", customer_tier="enterprise"}

The cardinality of http_requests_total is the number of unique combinations of those labels. With 10 routes, 5 methods, 8 status codes, 4 regions, and 3 customer tiers, you have 4,800 series for one metric. Add customer_id and the upper bound becomes 4,800 × N customers — which can mean tens of millions.

This isn't theoretical. Cloudflare's Prometheus-at-scale post walks through how a single misconfigured exporter can flood a TSDB. Their mitigation is a one-line config: sample_limit: 200. If a target produces more than 200 samples in a scrape, the scrape is rejected. Simple, brutal, effective.

Last9 and Sysdig both cite a soft ceiling of 1–2 million active series on a single Prometheus instance before query latency degrades sharply. Chronosphere's scaling guide puts the same number in different words. Above that, you're sharding, federating, or paying.

So cardinality is bounded by physics in self-hosted setups and by your wallet in commercial ones.

How Datadog's pricing reacts

Datadog defines a "custom metric" as a unique combination of metric name + tag values. The free tier per host gets you 100. Above that, SigNoz's pricing teardown and OneUptime's analysis put the marginal cost at roughly $0.05 per custom metric per month, with volume-tier discounts.

That sounds reasonable until you do the multiplication. Add customer_id to one widely-emitted metric in a B2B SaaS with 500 customers and you've added 500 × (other label cardinality) series. At enterprise discount levels you're still adding four or five figures to your monthly invoice from one git commit.

The HN thread on the $83K renewal is full of engineers describing exactly this: a single PR, merged on a Tuesday, billed on the 1st. We did a line-by-line of that thread in the $83K Datadog renewal teardown.

Why "use fewer tags" is the wrong answer

Tags are the unit of analysis. If you remove customer_id, you can no longer answer "which customer is timing out on checkout?" The tag exists because someone needed it. Removing it is regressing the product's debuggability.

The architectural answer is not fewer tags — it's a different storage tier for high-cardinality dimensions. Logs and traces are designed for high cardinality. Metrics are designed for aggregation. The mistake is putting customer_id on a metric instead of relying on the trace.

But here's the actual problem: the tooling doesn't tell you that. A developer adds a tag, the dashboard works, the change ships, and 30 days later finance sees the spike. There's no feedback loop at write-time.

Cost attribution before ingest — the design

What if the developer saw, at the moment they pushed the metric label, that the change would add $1,400/month to the bill? They wouldn't ship it. They'd write the same dashboard query against traces or logs.

That's the entire premise of cost-attribution-before-ingest. It needs three pieces:

1. A cardinality budget per service. Not per metric — per service. Engineers think in services. The budget is a knob the team owns.

2. A sampler that enforces it. When a service exceeds its budget, the collector either rate-limits new series, drops the offending labels, or refuses the scrape entirely (Cloudflare's sample_limit: 200 model). The behavior is configurable; the default is to warn and rate-limit.

3. A UI that shows the cost in real time. Not as a pricing dashboard — as part of the metric's own page. "This metric currently produces 12,400 series. The marginal series cost on your plan is $X. Adding customer_id as a label would project to 940,000 series."

The first two are infrastructure. The third is product design. Most observability vendors ship the first, half-ship the second, and don't ship the third at all — because the third is incentive-misaligned with selling more cardinality.

The Cloudflare model in detail

Worth dwelling on because it's the most honest pattern in the industry. Cloudflare runs Prometheus across thousands of nodes. Their scaling post describes a sample_limit per scrape and a recording-rule layer that pre-aggregates the dimensions humans actually query.

Two design choices matter. First: the limit is per scrape target, which means each service team's bad day doesn't take down the platform. Second: the pre-aggregation layer means dashboards query a low-cardinality view, not the raw stream. The raw stream is preserved (so you can drill in) but it's not the path of dashboard queries.

This is the pattern any honest observability product should adopt. We adopted it. We talked through the budget shape in the 50M time-series budget post.

Where Datadog could fix this

They could ship a "metric cost preview" in the UI before a tag is committed. They could expose per-service cardinality budgets. They could refund cardinality-driven overages on first occurrence. They don't, because every per-tag dollar is revenue, and the procurement cycle hasn't pushed back hard enough yet.

The HN threads suggest it's pushing back now. The cheaper-Datadog thread from January and the What instead of Datadog thread are full of teams running the math and choosing differently. The market is voting.

What this looks like in Sutrace

We track cardinality per service per metric, in real time. The UI shows the active series count, the trend, and the projected monthly cost on your plan. When a deploy adds a new label, the metric's page shows the delta — the series the label added, the projected cost change, and a one-click revert path that drops the label at the collector before it hits storage.

If a service exceeds its budget, the default is warn-and-rate-limit. The team's on-call gets a notification with a link to the metric, the offending labels, and the option to: (a) raise the budget, (b) move the label to logs/traces, or (c) accept the cap.

There's no separate "cost dashboard." Cost is attribution data on the metric itself. That's the product design difference.

We don't bill on cardinality. We bill on ingest you can predict. Cardinality is a thing we monitor, not a thing we charge for. The full pricing model is on the pricing page. The architectural piece is in Sutrace as a Datadog alternative.

The honest counter-argument

You could say: "If you don't bill on cardinality, you'll go bust the first time a customer ships a million-series metric." Fair. The answer is the budget. We will rate-limit, warn, and ultimately refuse to accept series above what your plan supports. The customer doesn't see a surprise invoice; they see a yellow banner and a config knob. Storage cost is bounded by enforcement, not by hoping the customer won't notice.

This is closer to how AWS bills you for EC2: there's a limit on what you can spin up, you can request more, and you can't accidentally spawn 10,000 instances. Observability tooling has been an exception — you can accidentally spawn 10,000,000 series — and that exception is the bug, not the feature.

What to do this week

  1. Run count by (__name__) ({__name__=~".+"}) against your Prometheus or Datadog metrics endpoint. Find the top 10 metrics by series count.
  2. For each, identify the labels driving cardinality. The usual suspects: customer_id, request_id, instance with auto-scaling churn, pod with high deploy frequency.
  3. For each high-cardinality label, ask: "Could I answer the same question from a trace or log?" If yes, demote the label.
  4. Set a per-service budget. Even a rough one. The act of writing it down changes how engineers think about the next PR.

The full migration version of this is in migrating from Datadog to OTel — the week-one checklist.

Closing

Cardinality became the cost driver because the industry's pricing model rewarded label sprawl with surprise charges instead of warnings. The fix is upstream: budget, enforce, attribute. Show engineers the cost at the moment they're making the choice. Then bill on what the customer can predict.

If you want to see your own metrics' cardinality without committing to anything, point an OTel collector at our OTLP endpoint with the free tier and look at the cardinality view. It's the first thing we built and it's the one report most teams find embarrassing.