all posts

otel

OTel Collector to ClickHouse — a quickstart you can run in an hour

The architecture, the YAML, the ClickHouse schema, and the gotchas. A working OpenTelemetry Collector → ClickHouse pipeline you can deploy today.

By Akshay Sarode· August 4, 2025· 12 min readopentelemetryclickhousecollectorobservability

OTel Collector to ClickHouse — a quickstart you can run in an hour

TL;DR. ClickHouse is the storage layer that quietly won the OTel backend race — see ClickHouse's own roundup of OTel-compatible platforms and the year-in-review post. This post is the working pipeline: an OpenTelemetry Collector configured with the clickhouseexporter, a ClickHouse schema for traces/metrics/logs, the gotchas (TTL, materialized columns, cardinality control), and the operational notes you'll wish you had on day two. If you've decided to self-host the storage and want a recipe that actually compiles, this is it. If you'd rather we run it for you, the Sutrace OTel backend page covers the managed version.

The pipeline below is the same one we run internally at Sutrace, simplified for a single-node demo.

Why ClickHouse for observability

Three reasons. First, the columnar storage compresses observability data 10–20x in practice — traces especially, because most spans are repetitive. Second, ClickHouse's query engine handles billion-row aggregations in seconds, which is the workload pattern dashboards generate. Third, the official OTel Collector exporter is mature enough that schema and exporter co-evolve.

Last9 and Sysdig document the failure modes of the alternative — single-instance Prometheus past the 1–2M active-series ceiling, then sharding misery. ClickHouse is the answer to "what if we just used a real OLAP database from the start?"

Architecture in one diagram, words version

You have:

  • Apps instrumented with the upstream OTel SDKs, emitting OTLP over gRPC to a local sidecar collector.
  • Sidecar collectors (one per node or per pod) that batch, sample, and forward OTLP to a regional collector.
  • Regional collectors (typically one per region or AZ) that handle compression, retry, and the final hop into ClickHouse.
  • ClickHouse cluster with one table per signal type (traces, metrics, logs) plus materialized views for the queries you run most.
  • UI layer — Grafana, Superset, or a vendor like Sutrace — querying ClickHouse over its native protocol.

The two-tier collector pattern matters at scale. The sidecar handles the per-app cardinality control (drop user_id labels, sanitise PII). The regional collector handles compression, retry, and writes to ClickHouse with proper batching. Don't write directly from sidecars — you'll DDoS your own ClickHouse cluster on a deploy storm.

The collector config

# otel-collector-clickhouse.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 25

  batch:
    timeout: 10s
    send_batch_size: 10000
    send_batch_max_size: 11000

  resource:
    attributes:
      - key: deployment.environment
        value: ${env:DEPLOY_ENV}
        action: upsert
      - key: cloud.region
        value: ${env:CLOUD_REGION}
        action: upsert

  # Cardinality defence — drop common high-card labels at the edge
  transform/cardinality:
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user_id")
          - delete_key(attributes, "request_id")
          - delete_key(attributes, "trace_id")

  # PII redaction — pre-storage
  attributes/redact:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: http.request.header.cookie
        action: delete

  filter/healthchecks:
    spans:
      span:
        - 'attributes["http.target"] == "/healthz"'
        - 'attributes["http.target"] == "/readyz"'

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000?dial_timeout=10s
    database: otel
    username: default
    password: ${env:CLICKHOUSE_PASSWORD}
    ttl: 720h           # 30-day default retention
    logs_table_name: otel_logs
    traces_table_name: otel_traces
    metrics_table_name: otel_metrics
    timeout: 10s
    sending_queue:
      enabled: true
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, attributes/redact, filter/healthchecks, batch, resource]
      exporters: [clickhouse]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, transform/cardinality, batch, resource]
      exporters: [clickhouse]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes/redact, batch, resource]
      exporters: [clickhouse]

A few non-obvious choices in here. memory_limiter is first in every pipeline so a metric storm can't OOM the collector. filter/healthchecks drops /healthz spans because they otherwise dominate trace volume in any Kubernetes deployment. attributes/redact is before batch so PII never sits in a batch buffer in plaintext.

The ttl: 720h setting is important. ClickHouse will use this to drive table-level TTL, but you should also set per-column TTL (below) for finer control.

The ClickHouse schema

The exporter creates these tables on first write, but if you want them shaped for your query patterns, create them yourself first:

-- Traces
CREATE TABLE otel.otel_traces (
    Timestamp DateTime64(9) CODEC(Delta, ZSTD(1)),
    TraceId String CODEC(ZSTD(1)),
    SpanId String CODEC(ZSTD(1)),
    ParentSpanId String CODEC(ZSTD(1)),
    TraceState String CODEC(ZSTD(1)),
    SpanName LowCardinality(String) CODEC(ZSTD(1)),
    SpanKind LowCardinality(String) CODEC(ZSTD(1)),
    ServiceName LowCardinality(String) CODEC(ZSTD(1)),
    ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    SpanAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    Duration Int64 CODEC(ZSTD(1)),
    StatusCode LowCardinality(String) CODEC(ZSTD(1)),
    StatusMessage String CODEC(ZSTD(1)),
    Events Nested (
        Timestamp DateTime64(9),
        Name LowCardinality(String),
        Attributes Map(LowCardinality(String), String)
    ) CODEC(ZSTD(1)),
    Links Nested (
        TraceId String,
        SpanId String,
        TraceState String,
        Attributes Map(LowCardinality(String), String)
    ) CODEC(ZSTD(1)),
    INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX idx_service ServiceName TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_duration Duration TYPE minmax GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
TTL toDate(Timestamp) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;

The LowCardinality(String) wrapper is the single most important optimisation. It tells ClickHouse to dictionary-encode strings with fewer than ~10k unique values — which is exactly what ServiceName, SpanKind, StatusCode, and resource attributes look like. Compression ratio improves dramatically. Without LowCardinality, traces tables balloon.

The idx_trace_id bloom filter index makes "find all spans for trace X" queries fast. Without it, you scan partitions sequentially and a high-volume service feels sluggish.

-- Metrics (sum example; gauge / histogram have similar shape)
CREATE TABLE otel.otel_metrics_sum (
    ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    ResourceSchemaUrl String CODEC(ZSTD(1)),
    ScopeName String CODEC(ZSTD(1)),
    ScopeVersion String CODEC(ZSTD(1)),
    ScopeAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    ScopeDroppedAttrCount UInt32,
    ScopeSchemaUrl String CODEC(ZSTD(1)),
    ServiceName LowCardinality(String) CODEC(ZSTD(1)),
    MetricName LowCardinality(String) CODEC(ZSTD(1)),
    MetricDescription String CODEC(ZSTD(1)),
    MetricUnit String CODEC(ZSTD(1)),
    Attributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    StartTimeUnix DateTime64(9) CODEC(Delta, ZSTD(1)),
    TimeUnix DateTime64(9) CODEC(Delta, ZSTD(1)),
    Value Float64 CODEC(ZSTD(1)),
    Flags UInt32,
    Exemplars Nested (
        FilteredAttributes Map(LowCardinality(String), String),
        TimeUnix DateTime64(9),
        Value Float64,
        SpanId String,
        TraceId String
    ) CODEC(ZSTD(1)),
    AggregationTemporality Int32,
    IsMonotonic Boolean
) ENGINE = MergeTree()
PARTITION BY toDate(TimeUnix)
ORDER BY (ServiceName, MetricName, Attributes, TimeUnix)
TTL toDate(TimeUnix) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;
-- Logs
CREATE TABLE otel.otel_logs (
    Timestamp DateTime64(9) CODEC(Delta, ZSTD(1)),
    TraceId String CODEC(ZSTD(1)),
    SpanId String CODEC(ZSTD(1)),
    TraceFlags UInt32,
    SeverityText LowCardinality(String) CODEC(ZSTD(1)),
    SeverityNumber Int32,
    ServiceName LowCardinality(String) CODEC(ZSTD(1)),
    Body String CODEC(ZSTD(1)),
    ResourceAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    LogAttributes Map(LowCardinality(String), String) CODEC(ZSTD(1)),
    INDEX idx_severity SeverityText TYPE bloom_filter(0.01) GRANULARITY 1,
    INDEX idx_body Body TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, toUnixTimestamp(Timestamp))
TTL toDate(Timestamp) + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;

The tokenbf_v1 token bloom filter on Body is what makes full-text log search fast. Without it, you're doing string scans over hundreds of GB. With it, common-token queries are sub-second.

Materialized views — the speedup most pipelines skip

Once raw data lands, build aggregation views for the dashboards you actually run. Example: hourly p99 latency per service:

CREATE MATERIALIZED VIEW otel.svc_latency_hourly
ENGINE = AggregatingMergeTree()
PARTITION BY toDate(hour)
ORDER BY (ServiceName, SpanName, hour)
AS SELECT
    ServiceName,
    SpanName,
    toStartOfHour(Timestamp) AS hour,
    quantileState(0.99)(Duration) AS p99_state,
    quantileState(0.50)(Duration) AS p50_state,
    countState() AS request_count_state
FROM otel.otel_traces
GROUP BY ServiceName, SpanName, hour;

Querying this view returns p99 latency in milliseconds for any service, in any time range, in single-digit milliseconds. It's the same pattern Cloudflare uses with their Prometheus recording rules — see the Cloudflare Prometheus-at-scale post for the rationale.

The gotchas

1. TTL drift between collector and table. If the exporter writes with a 720h TTL hint and the table is configured for 30 days, the 30-day setting wins. Always set TTL on the table, not just on the exporter.

2. Map columns are slower than Array(Tuple) for cardinality scans. If your dominant query pattern is "count series by label X", consider materialised columns extracted from the map. The exporter docs cover the pattern.

3. Cardinality still bites in ClickHouse. ClickHouse handles high-cardinality storage gracefully but query latency degrades when ORDER BY columns explode. Keep customer_id and request_id out of the table sort key. Demote them to attribute maps. We covered the math in cardinality explained with examples.

4. ZooKeeper / Keeper for replicated tables. A single-node setup uses MergeTree. A replicated cluster uses ReplicatedMergeTree, which needs ZooKeeper or ClickHouse Keeper. The HN comment from srcreigh about SigNoz's mandatory ZK + 1 ClickHouse host setup is real — for production, you do want this, but for a demo, single-node is fine.

5. Partition pruning. Always include Timestamp in your WHERE. A query without a time filter scans everything.

Operational notes for week two

Once the pipeline is running:

  • Watch the collector's exporter queue depth. If the queue is filling up, you're write-bound on ClickHouse. Add an inserter node, or batch larger.
  • Monitor system.parts count per table. More than ~5,000 active parts and your merges fall behind. Tune parts_to_throw_insert and consider larger batch sizes upstream.
  • Set up alerts on collector OTLP receiver errors. A bad sidecar config will silently drop spans; you want to know within minutes.
  • Run EXPLAIN PIPELINE on slow dashboard queries. Materialise the ones that scan more than 100M rows.

When to use this versus a managed backend

Self-hosting OTel-Collector + ClickHouse makes sense when:

  • You have ClickHouse expertise on the team already.
  • You have a strict cost ceiling and you're at scale where managed pricing dominates ops cost.
  • You have data residency requirements that no managed vendor satisfies for you.

Managed makes sense when:

  • You don't have a dedicated platform team.
  • You want cardinality cost attribution as a product feature, not a query you write.
  • You want EU residency without running the cluster yourself.

If managed is the right call, the Sutrace use-case page is the next read. If self-hosting is, the SigNoz comparison covers the OSS managed-or-self-host axis honestly.

Closing

The OTel-Collector → ClickHouse pipeline is mature enough in 2026 that an engineer can stand it up in an afternoon and have it run in production for a year. The bookkeeping — schema, TTL, materialised views, cardinality control — is what separates a working demo from something you trust on-call. Get those right and the rest is queries.

If you want the pipeline without the bookkeeping, that's what we run. If you want to roll your own and just want a sane starting config, you have it above. Either way, OTLP → ClickHouse is the architecture worth standardising on.