---
title: Cardinality, explained with examples your finance team will understand
description: What cardinality actually is, why high-cardinality labels break Prometheus and inflate Datadog bills, and the concrete arithmetic for HTTP method × status × path × user_id × region.
author: Akshay Sarode
published: 2025-05-21
updated: 2025-05-21
cluster: c5-otel
tags: [opentelemetry, prometheus, cardinality, datadog, observability]
reading: 10 min
hero: A multiplication tree with one branch shaded red.
---

# Cardinality, explained with examples your finance team will understand

**TL;DR.** Cardinality is the number of unique time-series your metrics produce. Each unique combination of labels on a metric is one series. Series count multiplies — `method × status × path × region × user_id` is a product, not a sum. Most cost surprises in observability come from a developer adding one label that multiplies the existing cardinality by a factor of 100 or 10,000. This post explains cardinality plainly, walks through five worked examples (small / medium / large / catastrophic / "what we actually shipped"), explains why it breaks both Prometheus (operationally) and Datadog (financially), and gives you a 5-minute audit you can run today. Aimed at engineers who want to talk to finance without slides, and finance who want to argue with engineers without slides.

## The plain-English version

A *time-series* is a stream of measurements over time, identified by a name and a set of labels.

A metric like `http_requests_total` is one *name*. The series under it are the unique combinations of labels:

```
http_requests_total{service="checkout", method="POST", status="200"}
http_requests_total{service="checkout", method="POST", status="500"}
http_requests_total{service="checkout", method="GET",  status="200"}
http_requests_total{service="login",    method="POST", status="200"}
... and so on
```

Each one of those is a separate stream of numbers stored separately on disk. Querying across them is fast when the count is small, slow when the count is large, and bills compound the same way.

*Cardinality* is just the number of unique series. If you have 4 services, 5 methods, and 8 status codes, you have at most 4 × 5 × 8 = 160 series for that metric. The keyword is *at most* — in practice, not all combinations exist. But the upper bound is the multiplication.

## Why it matters in two sentences

For self-hosted Prometheus: high cardinality kills query latency around 1–2M active series per instance. See [Last9](https://last9.io/blog/challenges-with-running-prometheus-at-scale/), [Sysdig](https://www.sysdig.com/blog/challenges-scale-prometheus), and [Chronosphere](https://chronosphere.io/learn/how-to-address-prometheus-scaling-challenges/) — they all converge on the same number.

For Datadog and other commercial backends: high cardinality is what you bill for. Datadog defines a "custom metric" as a unique combination of metric name + tag values. [SigNoz's pricing teardown](https://signoz.io/blog/datadog-pricing/) walks through how a single PR can multiply your invoice.

That's the entire problem in two sentences. The rest of this post is examples.

## Five worked examples

### Example 1 — Small (20 series)

A metric `db_query_duration_seconds` with two labels:

- `database` ∈ {`primary`, `replica`}  → 2 values
- `query_type` ∈ {`select`, `insert`, `update`, `delete`, `ddl`}  → 5 values

Maximum cardinality: 2 × 5 = **10 series.**

Plus we usually count Prometheus auto-labels (`instance`, `job`). Realistically each value in our app has 2 instances scraped. So 10 × 2 = **20 series.**

Cost on Datadog: 20 series × $0.05/series/mo = **$1/month.** Trivial.

Operational cost on Prometheus: invisible. You will never notice this metric.

### Example 2 — Medium (3,840 series)

A metric `http_requests_total` with the labels you'd write on day one of any web service:

- `method` ∈ {`GET`, `POST`, `PUT`, `DELETE`, `PATCH`}  → 5 values
- `status` ∈ {2xx, 3xx, 4xx, 5xx — but represented as the actual code, ~20 values seen in production}  → 20 values
- `path` (templated) ∈ ~32 routes  → 32 values
- `region` ∈ {`eu-west-3`, `eu-central-1`, `us-east-1`}  → 3 values
- `instance` (Prometheus auto-label, scrape target) ∈ ~4 pods × scaling → average ~4 values

Maximum cardinality: 5 × 20 × 32 × 3 × 4 = **38,400 series.** In practice, not all combinations exist (you don't have a `DELETE /healthz` returning 503), so the realistic count is closer to **3,840 series** — about 10% of the upper bound.

Cost on Datadog: 3,840 × $0.05 = **$192/month.** Noticeable but fine.

Operational cost on Prometheus: a healthy single metric. You'd have ~50 of these and still be at 200k series total — well under the 1–2M ceiling.

### Example 3 — Large (192,000 series)

Same as Example 2, but a developer adds `customer_tier` to track high-tier customer reliability separately:

- `customer_tier` ∈ {`free`, `pro`, `enterprise`, `internal`}  → 4 values

New maximum: 5 × 20 × 32 × 3 × 4 × 4 = 153,600. Realistic: ~**15,360 series** for this one metric.

Cost on Datadog: 15,360 × $0.05 = **$768/month.** Hmm.

Operational cost on Prometheus: still fine for one metric. The total budget across all your metrics is the question.

### Example 4 — Catastrophic (50M+ series)

Same as Example 3, but the developer wants per-customer reliability tracking and adds `customer_id`:

- `customer_id` for a B2B SaaS with 5,000 active customers  → 5,000 values

New maximum: 5 × 20 × 32 × 3 × 4 × 4 × 5,000 = 768,000,000 series. Realistic (sparsity discount): probably **76,800,000 series.**

Cost on Datadog: 76,800,000 × $0.05 = **$3,840,000/month.** No, that's not a typo.

Cost on Datadog with realistic enterprise discount and the assumption that not all customers hit all paths in a billing period: probably "only" $40k–$80k/month from one PR. Still catastrophic. This is the shape of the [HN $83K Datadog renewal thread](https://news.ycombinator.com/item?id=41357726).

Operational cost on Prometheus: completely impossible. You'd OOM the Prometheus pod within an hour. [Cloudflare's `sample_limit: 200`](https://blog.cloudflare.com/how-cloudflare-runs-prometheus-at-scale/) was designed exactly for this kind of explosion — the scrape would be rejected immediately. We covered the defence pattern in [Prometheus at scale, the Cloudflare 200 rule](/blog/prometheus-at-scale-cloudflare-200-rule).

### Example 5 — What we actually shipped (a real anonymised case)

A team we worked with had `http_requests_total` with the standard labels plus a `request_id` label (someone wanted to "find slow requests"). `request_id` is a UUID — every request has a unique value.

Cardinality on `request_id` alone: roughly **1 series per request, ever, until restart.** Over 24 hours of normal traffic at 50 req/s, that's 4.3M unique series. Per metric. Per day.

The Prometheus pod OOM'd 90 minutes after deploy. The team rolled back. Total damage: 4 hours of degraded monitoring, two engineers' afternoon. No invoice impact (self-hosted) but full operational tax.

The lesson: there are some labels that should *never* go on a metric. UUIDs, timestamps, full URLs, IP addresses, free-text user input. They belong on traces or logs, where the storage is built for high cardinality. We covered the architectural answer in [cardinality cost attribution before the bill arrives](/blog/cardinality-cost-attribution-before-the-bill).

> [!NOTE]
> Diagram: Five panels, one per example. Series counts on a log scale. The fifth panel is a flat line that goes vertical.

## Why high cardinality breaks each backend differently

**Prometheus.** Each series occupies memory while it's actively receiving samples. Prometheus's TSDB indexes labels for fast lookup. High cardinality bloats the index, lengthens WAL replay times, and slows queries because more series means more candidate matches. The 1–2M ceiling cited by [Last9](https://last9.io/blog/challenges-with-running-prometheus-at-scale/), [Sysdig](https://www.sysdig.com/blog/challenges-scale-prometheus), and [Chronosphere](https://chronosphere.io/learn/how-to-address-prometheus-scaling-challenges/) is a soft physics ceiling, not a config knob.

**Datadog.** Custom metric pricing is per unique series per month. The pricing model is structurally rewarding label additions, and the surprise lands 30 days later as an invoice line item that nobody can attribute to a specific PR without forensic effort. [SigNoz's Datadog teardown](https://signoz.io/blog/datadog-pricing/) and [OneUptime's pricing analysis](https://oneuptime.com/blog/post/2026-03-13-how-datadog-pricing-actually-works/view) document the multipliers.

**ClickHouse / OTLP-native backends.** Storage handles high cardinality gracefully because columnar compression handles repetitive labels well. The problem moves to *query* — `ORDER BY` columns with high cardinality slow scans. The architectural answer is to keep high-cardinality dimensions in attribute maps (not in the sort key) and to use materialised views for common aggregations. We covered the schema patterns in [the OTel + ClickHouse quickstart](/blog/otel-collector-clickhouse-quickstart).

**Loki / log storage.** Logs are inherently high-cardinality and storage is designed for it. Adding `request_id` to a log is fine. Adding `request_id` as a metric label is not.

This is the punchline most teams miss: *the right answer to "I want per-customer debugging" is traces and logs, not metric labels.*

## The 5-minute audit you can run today

If you have Prometheus or any PromQL endpoint:

```
topk(20, count by (__name__)({__name__=~".+"}))
```

Returns the top 20 metric names by series count. Look for surprises.

For each surprise, drill down with:

```
topk(10, count by (__name__, label_X) ({__name__="surprising_metric"}))
```

Substitute `label_X` for each label, one at a time, until you find which label is contributing the cardinality. The label whose `count` is closest to the total cardinality is the culprit.

For Datadog users: the [Metrics without Limits](https://docs.datadoghq.com/metrics/metrics-without-limits/) feature exposes per-metric cardinality. The "Cardinality" column on the metrics summary page is the same data.

## When a label is *worth* the cardinality

Not every high-cardinality label is wrong. Some are essential:

- **`environment`** with values like `prod`, `staging`, `dev` — low cardinality, high value.
- **`service`** — your services are bounded; useful.
- **`region`** — bounded; useful.
- **`status_code`** — bounded; useful.

These have *small N* and answer questions you ask every day. They're cheap and worth the spend.

The labels that bite are the ones with unbounded N (`request_id`, `customer_id`, `pod_name` in churning autoscalers, `trace_id`) or the ones whose N grows with traffic (`url` un-templated, `user_agent`, `ip`).

## The architectural answer

The right architectural fix isn't "use fewer labels." It's:

1. **Three storage tiers.** Metrics for low-cardinality aggregates. Traces for per-request detail with high cardinality. Logs for unstructured detail with arbitrary cardinality.
2. **A budget per service.** Series count per service, enforced at the collector. When the budget is exceeded, warn loudly. Cloudflare's `sample_limit: 200` is the canonical version.
3. **Cost attribution at write-time.** The developer should see, at the moment they push a label change, what it'll cost them. We unpacked the design in [cardinality cost attribution before the bill arrives](/blog/cardinality-cost-attribution-before-the-bill).

These three are upstream of the bill. They turn a 30-day-late invoice surprise into a same-deploy alert. That's the design Sutrace is built around — see [the OTel backend page](/use-cases/opentelemetry-backend) for the architectural detail and [the pricing page](/pricing) for how this lands in our SKU.

## What to remember

- Cardinality is multiplication. Each label adds a *factor*, not a *summand*.
- High cardinality is fine for traces and logs. It's structurally bad for metrics.
- The labels that hurt are the ones whose value space is unbounded (`request_id`, `customer_id`, UUIDs).
- The cheapest defence is `sample_limit` on every Prometheus scrape config. The architectural defence is per-service budgets enforced at the collector.
- The conversation with finance is "we're going to add this label, here's what it'll cost." Not "the bill went up, why?"

## Closing

Cardinality is the single most expensive concept in observability that no one teaches in a course. If you take one thing from this post: *next time someone proposes adding a label, do the multiplication.* `existing_series × N_new_label_values = new_series`. If `N_new_label_values` is unbounded — UUIDs, customer_ids in a B2C product, full URLs — the answer is: not on a metric. On a trace.

The [Sutrace pricing page](/pricing) doesn't bill on cardinality. We track it, budget it, and warn when it spikes. That's a deliberate product choice, made because the rest of the industry's pricing model is a footgun. The [OTel backend use-case page](/use-cases/opentelemetry-backend) covers the implementation; the [Datadog comparison](/compare/sutrace-vs-datadog) covers the migration.

Multiplication, not addition. That's the entire post.
