datadog
Migrating from Datadog to OTel — the week-one checklist
A concrete day-by-day plan for moving off the Datadog Agent to a vendor-neutral OpenTelemetry collector. With config, traps, and what to skip.
Migrating from Datadog to OTel — the week-one checklist
TL;DR. You don't migrate off Datadog by ripping out the agent. You migrate by standing up an OpenTelemetry collector that dual-writes to Datadog and to your new backend, then cutting over service-by-service while both backends stay live. Week one is collector + dual-write + first three services. Week two is dashboards + alerts + the rest of the services. Week three is logs + synthetics. Week four is decommission. This post is the day-by-day version, with collector config, the migration traps we've watched teams hit, and the parts you can safely skip.
The market direction is clear. The Grafana 2025 OpenTelemetry report shows OTel adoption past inflection — most new instrumentation in 2025 was OTel. The Dynatrace 2025 OpenTelemetry trends post and ClickHouse's compatible-platforms list both confirm: vendor-neutral instrumentation is the default. The question isn't whether to migrate, it's whether to do it deliberately or under renewal pressure.
Day 0 — Inventory before you touch anything
Before any code changes, do this:
- Export your Datadog metric list. The Datadog API or
datadog-cli metric listworks. Dump it to CSV. - Tag every metric with three columns. Owner team, label cardinality estimate, last-queried date. The last-queried date is the most useful column you'll generate this week — typically 25–40% of metrics haven't been queried in 90 days.
- List your dashboards. Sort by view count. The top 20 are probably 80% of the value.
- List your alerts. Sort by trigger count over the last 30 days. The chatty ones are the most expensive to migrate (you'll want to fix them, not port them).
- List your synthetics. Browser checks, API checks, multi-step. Note the regional fan-out per check — this is where the cost is hiding (Checkly's $12-at-a-time math is the reference).
This inventory is the migration plan. Without it you'll port noise.
Day 1 — Stand up the OTel collector
Run the collector as a sidecar, DaemonSet, or standalone process — whatever fits your platform. The minimum config dual-writes to Datadog and to your new backend (Sutrace's OTLP endpoint in the example below; substitute your own):
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'app'
scrape_interval: 30s
sample_limit: 200 # Cloudflare-style guardrail
static_configs:
- targets: ['app:9090']
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 25
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
exporters:
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.eu
otlp/sutrace:
endpoint: ingest.sutrace.io:443
headers:
authorization: Bearer ${SUTRACE_TOKEN}
tls:
insecure: false
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [datadog, otlp/sutrace]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [datadog, otlp/sutrace]
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [datadog, otlp/sutrace]
The sample_limit: 200 line is the Cloudflare pattern. It will save you from a runaway scrape during the migration window.
Verify with otelcol --config=otel-config.yaml --dry-run and then point one service's traffic at the collector. Watch both Datadog and your new backend receive data.
Day 2 — Convert one service to OTel SDK
Pick the lowest-stakes service. Don't pick the payment service first.
Replace dd-trace with the OTel SDK. For a Python service:
# Before
from ddtrace import tracer
from ddtrace.runtime import RuntimeMetrics
RuntimeMetrics.enable()
# After
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.auto_instrumentation import sitecustomize
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)))
trace.set_tracer_provider(provider)
For Node, Go, Java, .NET — the OTel docs cover these. The auto-instrumentation packages handle 80% of the work. The 20% is custom spans, which transfer one-to-one.
Validate by sending a request through the service and seeing the trace in both Datadog and the new backend. Both should show the same span graph.
Day 3 — Dashboards (the top 20, not all of them)
Open your Datadog dashboard list, sorted by view count. Recreate the top 20 in your new backend. The other 200 are abandoned — leave them in Datadog for the cutover window and delete after.
DDQL → PromQL is mostly mechanical:
| DDQL | PromQL |
|---|---|
avg:http.requests{*} | avg(http_requests_total) |
sum:http.requests{*} by {service}.as_rate() | sum by (service) (rate(http_requests_total[5m])) |
p95:http.duration{*} | histogram_quantile(0.95, sum(rate(http_duration_bucket[5m])) by (le)) |
anomalies(...) | Use built-in anomaly rules in your backend |
The as_rate() conversion is the trap most teams hit — Datadog returns rate-per-interval, Prometheus returns rate-per-second. If the dashboard numbers are off by 60×, you forgot the time-base conversion.
Keep both dashboards open side-by-side for a week. If they don't match, the new one is wrong; trust the agent until you've reconciled.
Day 4 — Alerts
This is where you get to delete things.
Pull your alert list sorted by trigger count over 30 days. Anything that fired more than 100 times is noise, not signal. Don't port noise. Rewrite or delete.
For the alerts you do port, the pattern is:
- Export the alert definition from Datadog.
- Map the query to PromQL or your backend's alert language.
- Set the destination to the same channel (PagerDuty, Slack, etc.) but with a
[shadow]prefix in the title for the first week. - Run both alerts in parallel. The shadow alert should fire on the same conditions; if not, fix it.
- After a week of clean shadow runs, drop the
[shadow]prefix and disable the Datadog alert.
The hardest alerts to migrate are anomaly-detection alerts using Datadog's Watchdog. Be honest about whether you actually use the anomaly signal or whether you ack it and move on. If it's the latter, replace with a static threshold.
Day 5 — First production cutover
Pick one service from Day 2's pilot. In its OTel collector config, remove the Datadog exporter. The service now writes only to the new backend.
Watch dashboards, alerts, and on-call channels for 24 hours. If nothing breaks, you've validated the path.
Week 2 — Logs and traces at scale
Logs. If you're shipping with the Datadog Agent's log collector, replace with Vector or the OTel logs receiver. Vector is the fastest path because it has Datadog source/sink parity and supports OTel as a sink:
[sources.app_logs]
type = "file"
include = ["/var/log/app/*.log"]
[sinks.otlp]
type = "opentelemetry"
inputs = ["app_logs"]
endpoint = "http://otel-collector:4318"
[sinks.dd]
type = "datadog_logs"
inputs = ["app_logs"]
default_api_key = "${DD_API_KEY}"
Dual-write for a week. Check log counts in both backends match (within 1%). If they don't, you have a parsing or batching bug — fix it before cutover.
Traces. If you're already on OTel SDKs from Day 2, traces are already migrating. The remaining work is the long tail of services. Schedule one service per day for the rest of week two.
Week 3 — Synthetics and uptime
Synthetics is where cost savings show up the fastest. The Checkly post shows $8,509/month for 16 routes × 4 regions × every 4 minutes. We worked through the math in the Datadog synthetics calculator post.
Move synthetics in one go. They're not load-bearing on the migration — duplicate them in the new tool, run for a week, then disable in Datadog.
Week 4 — Decommission
- Drop the Datadog exporter from all collector configs.
- Remove the Datadog Agent from base images and Helm charts.
- Disable the Datadog dashboards and alerts (don't delete yet — keep for 30 days as backup).
- Notify procurement of the renewal change before the next billing cycle.
- After 30 days, delete the dashboards and downgrade the contract.
The order matters. Do not cancel before week 4. The cost of running both for a month is trivial compared to the cost of botching the cutover.
What to skip
You do not need to migrate:
- Dashboards with zero views in 90 days. Delete them.
- Alerts that fired more than once a week and were always acked. Rewrite or delete.
- Custom metric tags that exist for one engineer's debugging session from 2023. Audit at migration time.
- The Datadog Mobile App. The new backend has its own.
- Watchdog alerts on metrics nobody owns. Find the owner or delete.
Common traps
The label rename trap. Datadog auto-tags some attributes (env, service, version). OTel uses deployment.environment, service.name, service.version. Add a resource processor that maps these consistently or your dashboards will be empty.
The histogram bucket trap. Datadog stores histograms; Prometheus uses bucketed counters. p95 queries look different. The PromQL pattern in the table above is the canonical form.
The cardinality trap. Migration is the moment teams discover that their customer_id label was costing $4,000/month. The fix is not "remove it from the new system" — it's "move it from metrics to traces/logs," where high cardinality is fine. The cardinality post covers this.
The "I'll skip dual-write" trap. Don't. The week of dual-write costs you ~$300 in extra ingest and saves you a Sev-1 during cutover.
What this looks like in Sutrace
We're an OTel-native backend. The collector config above (with the otlp/sutrace exporter) is literally the config we ship to new customers. We provide a DDQL→PromQL converter for the top 50 query patterns and a dashboard import tool that takes a Datadog JSON export.
The migration usually takes 3–4 weeks of part-time effort from one engineer. We help with the trace mapping and the dashboard recreation. The Datadog alternative page has the side-by-side. Pricing is on the pricing page.
Closing
Migration is not technical risk. It's organizational risk — the question is whether your team has 4 weeks of bandwidth to do it deliberately instead of waiting for a renewal forcing function. Dual-write is the trick. Decommission last. Trust the inventory more than the dashboards. And don't port noise.
If you want a starter collector config tuned for your specific stack, the use case page for OTel backends has language-specific snippets. Trial accepts your existing OTel collector with one config change.