otel
OTel Collector to ClickHouse — a quickstart you can run in an hour
The architecture, the YAML, the ClickHouse schema, and the gotchas. A working OpenTelemetry Collector → ClickHouse pipeline you can deploy today.
OTel Collector to ClickHouse — a quickstart you can run in an hour
TL;DR. ClickHouse is the storage layer that quietly won the OTel backend race — see ClickHouse's own roundup of OTel-compatible platforms and the year-in-review post. This post is the working pipeline: an OpenTelemetry Collector configured with the clickhouseexporter, a ClickHouse schema for traces/metrics/logs, the gotchas (TTL, materialized columns, cardinality control), and the operational notes you'll wish you had on day two. If you've decided to self-host the storage and want a recipe that actually compiles, this is it. If you'd rather we run it for you, the Sutrace OTel backend page covers the managed version.
The pipeline below is the same one we run internally at Sutrace, simplified for a single-node demo.
Why ClickHouse for observability
Three reasons. First, the columnar storage compresses observability data 10–20x in practice — traces especially, because most spans are repetitive. Second, ClickHouse's query engine handles billion-row aggregations in seconds, which is the workload pattern dashboards generate. Third, the official OTel Collector exporter is mature enough that schema and exporter co-evolve.
Last9 and Sysdig document the failure modes of the alternative — single-instance Prometheus past the 1–2M active-series ceiling, then sharding misery. ClickHouse is the answer to "what if we just used a real OLAP database from the start?"
Architecture in one diagram, words version
You have:
- Apps instrumented with the upstream OTel SDKs, emitting OTLP over gRPC to a local sidecar collector.
- Sidecar collectors (one per node or per pod) that batch, sample, and forward OTLP to a regional collector.
- Regional collectors (typically one per region or AZ) that handle compression, retry, and the final hop into ClickHouse.
- ClickHouse cluster with one table per signal type (traces, metrics, logs) plus materialized views for the queries you run most.
- UI layer — Grafana, Superset, or a vendor like Sutrace — querying ClickHouse over its native protocol.
The two-tier collector pattern matters at scale. The sidecar handles the per-app cardinality control (drop user_id labels, sanitise PII). The regional collector handles compression, retry, and writes to ClickHouse with proper batching. Don't write directly from sidecars — you'll DDoS your own ClickHouse cluster on a deploy storm.
The collector config
# otel-collector-clickhouse.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 25
batch:
timeout: 10s
send_batch_size: 10000
send_batch_max_size: 11000
resource:
attributes:
- key: deployment.environment
value: ${env:DEPLOY_ENV}
action: upsert
- key: cloud.region
value: ${env:CLOUD_REGION}
action: upsert
# Cardinality defence — drop common high-card labels at the edge
transform/cardinality:
metric_statements:
- context: datapoint
statements:
- delete_key(attributes, "user_id")
- delete_key(attributes, "request_id")
- delete_key(attributes, "trace_id")
# PII redaction — pre-storage
attributes/redact:
actions:
- key: http.request.header.authorization
action: delete
- key: http.request.header.cookie
action: delete
filter/healthchecks:
spans:
span:
- 'attributes["http.target"] == "/healthz"'
- 'attributes["http.target"] == "/readyz"'
exporters:
clickhouse:
endpoint: tcp://clickhouse:9000?dial_timeout=10s
database: otel
username: default
password: ${env:CLICKHOUSE_PASSWORD}
ttl: 720h # 30-day default retention
logs_table_name: otel_logs
traces_table_name: otel_traces
metrics_table_name: otel_metrics
timeout: 10s
sending_queue:
enabled: true
queue_size: 5000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:8888
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, attributes/redact, filter/healthchecks, batch, resource]
exporters: [clickhouse]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, transform/cardinality, batch, resource]
exporters: [clickhouse]
logs:
receivers: [otlp]
processors: [memory_limiter, attributes/redact, batch, resource]
exporters: [clickhouse]
A few non-obvious choices in here. memory_limiter is first in every pipeline so a metric storm can't OOM the collector. filter/healthchecks drops /healthz spans because they otherwise dominate trace volume in any Kubernetes deployment. attributes/redact is before batch so PII never sits in a batch buffer in plaintext.
The ttl: 720h setting is important. ClickHouse will use this to drive table-level TTL, but you should also set per-column TTL (below) for finer control.
The ClickHouse schema
The exporter creates these tables on first write, but if you want them shaped for your query patterns, create them yourself first:
-- Traces
CREATE TABLE otel.otel_traces (
Timestamp DateTime64(9) CODEC(Delta, ZSTD(1)),
TraceId String CODEC(ZSTD(1)),
SpanId String CODEC(ZSTD(1)),
ParentSpanId String CODEC(ZSTD(1)),
TraceState String CODEC(ZSTD(1)),
SpanName LowCardinality(String) CODEC(ZSTD(1)),
SpanKind LowCardinality(String) CODEC(ZSTD(1)),
ServiceName LowCardinality(String) CODEC(ZSTD(1)),
ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
SpanAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
Duration Int64 CODEC(ZSTD(1)),
StatusCode LowCardinality(String) CODEC(ZSTD(1)),
StatusMessage String CODEC(ZSTD(1)),
Events Nested (
Timestamp DateTime64(9),
Name LowCardinality(String),
Attributes Map(LowCardinality(String), String)
) CODEC(ZSTD(1)),
Links Nested (
TraceId String,
SpanId String,
TraceState String,
Attributes Map(LowCardinality(String), String)
) CODEC(ZSTD(1)),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_service ServiceName TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_duration Duration TYPE minmax GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
TTL toDate(Timestamp) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;
The LowCardinality(String) wrapper is the single most important optimisation. It tells ClickHouse to dictionary-encode strings with fewer than ~10k unique values — which is exactly what ServiceName, SpanKind, StatusCode, and resource attributes look like. Compression ratio improves dramatically. Without LowCardinality, traces tables balloon.
The idx_trace_id bloom filter index makes "find all spans for trace X" queries fast. Without it, you scan partitions sequentially and a high-volume service feels sluggish.
-- Metrics (sum example; gauge / histogram have similar shape)
CREATE TABLE otel.otel_metrics_sum (
ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
ResourceSchemaUrl String CODEC(ZSTD(1)),
ScopeName String CODEC(ZSTD(1)),
ScopeVersion String CODEC(ZSTD(1)),
ScopeAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
ScopeDroppedAttrCount UInt32,
ScopeSchemaUrl String CODEC(ZSTD(1)),
ServiceName LowCardinality(String) CODEC(ZSTD(1)),
MetricName LowCardinality(String) CODEC(ZSTD(1)),
MetricDescription String CODEC(ZSTD(1)),
MetricUnit String CODEC(ZSTD(1)),
Attributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
StartTimeUnix DateTime64(9) CODEC(Delta, ZSTD(1)),
TimeUnix DateTime64(9) CODEC(Delta, ZSTD(1)),
Value Float64 CODEC(ZSTD(1)),
Flags UInt32,
Exemplars Nested (
FilteredAttributes Map(LowCardinality(String), String),
TimeUnix DateTime64(9),
Value Float64,
SpanId String,
TraceId String
) CODEC(ZSTD(1)),
AggregationTemporality Int32,
IsMonotonic Boolean
) ENGINE = MergeTree()
PARTITION BY toDate(TimeUnix)
ORDER BY (ServiceName, MetricName, Attributes, TimeUnix)
TTL toDate(TimeUnix) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;
-- Logs
CREATE TABLE otel.otel_logs (
Timestamp DateTime64(9) CODEC(Delta, ZSTD(1)),
TraceId String CODEC(ZSTD(1)),
SpanId String CODEC(ZSTD(1)),
TraceFlags UInt32,
SeverityText LowCardinality(String) CODEC(ZSTD(1)),
SeverityNumber Int32,
ServiceName LowCardinality(String) CODEC(ZSTD(1)),
Body String CODEC(ZSTD(1)),
ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
LogAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
INDEX idx_severity SeverityText TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_body Body TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, toUnixTimestamp(Timestamp))
TTL toDate(Timestamp) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;
The tokenbf_v1 token bloom filter on Body is what makes full-text log search fast. Without it, you're doing string scans over hundreds of GB. With it, common-token queries are sub-second.
Materialized views — the speedup most pipelines skip
Once raw data lands, build aggregation views for the dashboards you actually run. Example: hourly p99 latency per service:
CREATE MATERIALIZED VIEW otel.svc_latency_hourly
ENGINE = AggregatingMergeTree()
PARTITION BY toDate(hour)
ORDER BY (ServiceName, SpanName, hour)
AS SELECT
ServiceName,
SpanName,
toStartOfHour(Timestamp) AS hour,
quantileState(0.99)(Duration) AS p99_state,
quantileState(0.50)(Duration) AS p50_state,
countState() AS request_count_state
FROM otel.otel_traces
GROUP BY ServiceName, SpanName, hour;
Querying this view returns p99 latency in milliseconds for any service, in any time range, in single-digit milliseconds. It's the same pattern Cloudflare uses with their Prometheus recording rules — see the Cloudflare Prometheus-at-scale post for the rationale.
The gotchas
1. TTL drift between collector and table. If the exporter writes with a 720h TTL hint and the table is configured for 30 days, the 30-day setting wins. Always set TTL on the table, not just on the exporter.
2. Map columns are slower than Array(Tuple) for cardinality scans. If your dominant query pattern is "count series by label X", consider materialised columns extracted from the map. The exporter docs cover the pattern.
3. Cardinality still bites in ClickHouse. ClickHouse handles high-cardinality storage gracefully but query latency degrades when ORDER BY columns explode. Keep customer_id and request_id out of the table sort key. Demote them to attribute maps. We covered the math in cardinality explained with examples.
4. ZooKeeper / Keeper for replicated tables. A single-node setup uses MergeTree. A replicated cluster uses ReplicatedMergeTree, which needs ZooKeeper or ClickHouse Keeper. The HN comment from srcreigh about SigNoz's mandatory ZK + 1 ClickHouse host setup is real — for production, you do want this, but for a demo, single-node is fine.
5. Partition pruning. Always include Timestamp in your WHERE. A query without a time filter scans everything.
Operational notes for week two
Once the pipeline is running:
- Watch the collector's exporter queue depth. If the queue is filling up, you're write-bound on ClickHouse. Add an inserter node, or batch larger.
- Monitor
system.partscount per table. More than ~5,000 active parts and your merges fall behind. Tuneparts_to_throw_insertand consider larger batch sizes upstream. - Set up alerts on collector OTLP receiver errors. A bad sidecar config will silently drop spans; you want to know within minutes.
- Run
EXPLAIN PIPELINEon slow dashboard queries. Materialise the ones that scan more than 100M rows.
When to use this versus a managed backend
Self-hosting OTel-Collector + ClickHouse makes sense when:
- You have ClickHouse expertise on the team already.
- You have a strict cost ceiling and you're at scale where managed pricing dominates ops cost.
- You have data residency requirements that no managed vendor satisfies for you.
Managed makes sense when:
- You don't have a dedicated platform team.
- You want cardinality cost attribution as a product feature, not a query you write.
- You want EU residency without running the cluster yourself.
If managed is the right call, the Sutrace use-case page is the next read. If self-hosting is, the SigNoz comparison covers the OSS managed-or-self-host axis honestly.
Closing
The OTel-Collector → ClickHouse pipeline is mature enough in 2026 that an engineer can stand it up in an afternoon and have it run in production for a year. The bookkeeping — schema, TTL, materialised views, cardinality control — is what separates a working demo from something you trust on-call. Get those right and the rest is queries.
If you want the pipeline without the bookkeeping, that's what we run. If you want to roll your own and just want a sane starting config, you have it above. Either way, OTLP → ClickHouse is the architecture worth standardising on.