ai agents

Helicone vs LangSmith vs Langfuse vs Phoenix — what each one actually gets wrong

A 4-way honest comparison of the leading LLM observability tools, the gateway-plus-eval hybrid pattern that emerged, and where Sutrace fits.

By Akshay Sarode· February 18, 2026· 14 min readllmai-agentsobservabilitylangsmith

Helicone vs LangSmith vs Langfuse vs Phoenix — what each one actually gets wrong

TL;DR. There are four serious LLM-observability tools shipping in 2026: LangSmith (best for LangChain teams, $39/seat Plus + per-trace overage), Helicone (fastest setup, $20/seat with $200/mo cap, weakest evals), Langfuse (Apache-2.0, the genuinely good self-host story, $50/mo Cloud Pro starter, MCP gap), and Arize Phoenix (OSS, OpenInference-native, sparse production tooling). All four have real gaps. The pattern teams keep landing on: a gateway (Helicone-style) for caching and routing, plus an eval-first tool (Langfuse or LangSmith) for the depth — two bills, two setups. This post is the honest tour: real pricing math, framework lock-in, self-hosting, eval depth, side-by-side, and where the consolidation goes from here. The triggering read is Soufian Azzaoui's DEV writeup of trying all four — the most honest field report I've seen. This is mine, with prices.

The category, briefly

LLM observability is a 2-year-old category that's already crowded. The shape: a tool that records traces of agent runs, attributes cost, scores outputs, and helps you debug regressions. The four leaders by Github stars, customer adoption, and community depth: LangSmith, Helicone, Langfuse, Phoenix. There are dozens of others — Maxim's 2026 top-5 list is a useful enumeration — but these four cover ~85% of the conversation.

The category split, in one sentence: gateway tools (Helicone) vs SDK-instrumentation tools (LangSmith, Langfuse, Phoenix). Gateway tools sit in front of the provider as a proxy and observe traffic. SDK tools sit in your code and observe spans. Different architectures, different trade-offs.

The four tools, with prices

LangSmith

The native observability product from the LangChain team. Best-in-class if you live in LangChain.

Price:

Tier	Price	Included	Note
Developer	$0	5k traces/mo	Free
Plus	$39/seat/mo	10k traces/seat	Per-trace overage
Enterprise	Custom	Custom	SSO, on-prem

The trap: "Trace" is loosely defined and has shifted across releases. A long agent run can produce 50–200 spans that count as separate billable units in some configurations. Helicone's comparison walks the math; Confident AI's alternatives and Mirascope's roundup are the honest external reads.

What it gets right: Zero-config tracing inside LangChain. Eval primitives that match Runnable. The Hub for prompts.

What it gets wrong: Per-trace pricing turns exponential at scale. Framework lock-in to LangChain — you can use LangSmith without LangChain but the ergonomic value drops sharply. US-resident by default; EU available but not default. No budget enforcement, only observation. See the LangSmith alternatives breakdown.

Helicone

The proxy-first, base-URL-flip option. Fastest setup in the category.

Price:

Tier	Price	Included	Note
Free	$0	10k requests	Free
Pro	$20/seat/mo	$200/mo cap	Cap is the unusual feature
Enterprise	Custom	Custom	Self-host gateway available

What it gets right: Five-minute setup. Caching as a first-class feature (proxy architecture lets it cache transparently). Reasonable Pro tier with a usage cap — most competitors don't cap. Honest competitor guide.

What it gets wrong: Eval tooling is the weakest in the category — Soufian called it "afterthought." Budget control is observation, not enforcement. Proxy hop adds 10–30ms per request. Limited multi-provider routing visibility. See the Helicone alternatives breakdown.

Langfuse

The OSS observability tool that actually works.

Price:

Tier	Price	Included	Note
Hobby (Cloud)	$0	50k events	Free
Cloud Pro	$50/mo	100k events	+ per-event overage
Cloud Team	$199/mo	1M events	SSO, longer retention
Self-host	$0 (subscription)	Unlimited	Apache-2.0

What it gets right: Apache-2.0 license, no asterisks. Best self-host story in the category — the docs cover Postgres, ClickHouse, S3, OAuth. Strong eval tooling, competitive with LangSmith. Cost transparency. The ZenML comparison is honest.

What it gets wrong: Self-host means three databases (Postgres, ClickHouse, Redis) for your SRE team to operate. MCP support is partial as of early 2026. No budget enforcement, only observation. No on-host PII redaction. See the Langfuse alternatives breakdown.

Arize Phoenix

The OSS tool from Arize AI's team. OpenInference-native.

Price: Free (OSS). Arize sells a closed cloud (AX) on top, custom pricing.

What it gets right: OpenInference is the cleanest OTel-aligned semantic convention for LLMs. Strong tracing primitives. Notebook-first ergonomics for evaluation. Free.

What it gets wrong: Production tooling is sparse — alerting, multi-tenant access control, retention, and operational stability all lag the others. Most teams who try Phoenix end up using it for local dev and pairing it with a managed tool for production. The Arize cloud (AX) closes the gap but is enterprise-priced.

Side-by-side

Dimension	LangSmith	Helicone	Langfuse	Phoenix	Sutrace
Setup	LangChain native	Base URL flip	SDK or OTel	Notebook-first	OTel collector
License	Closed cloud	Closed cloud, OSS gateway	Apache-2.0 + closed cloud	Apache-2.0	Closed cloud, OSS SDK
Pricing	Per-seat + per-trace	Per-seat, $200 cap	Per-event tiers	Free / custom AX	Per-GB ingest + per-seat
Plus tier	$39/seat	$20/seat	$50/mo	Free	Flat ingest
Self-host	Enterprise on-prem	OSS gateway	Yes (good)	Yes	No
EU residency	Available	US default	Cloud EU available	Self-host wherever	`europe-west3` default
Eval depth	Strong	Weak	Strong	Strong (notebook)	Strong
Budget enforcement	Observe only	Observe only	Observe only	Observe only	Synchronous interlock
On-host redaction	No	No	No	No	Yes
Prompt-injection signals	No	No	No	No	Yes
MCP tracing	Partial	No	Partial	No	Native
Multi-provider routing	Limited	Limited	Custom	Custom	Native
Hardware/SCADA	No	No	No	No	Yes

Real pricing math at three scales

Let's stop talking in tiers and do the actual numbers.

Scenario A: Solo dev / prototype (10k requests/mo, 1 seat)

Tool	Monthly cost
LangSmith	$0 (Developer tier)
Helicone	$0 (Free tier)
Langfuse Cloud	$0 (Hobby)
Phoenix	$0 (OSS)
Sutrace	$0 (Free tier)

All free. Pick on ergonomics.

Scenario B: Small team in production (500k requests/mo, 5 seats, 30-span runs)

Tool	Monthly cost
LangSmith	~$195 + overage on traces (50k/seat included) ≈ $250–$400
Helicone	$100 (capped at $200/mo regardless)
Langfuse Cloud Pro	$50 + overage ≈ $80–$150
Phoenix	$0 (OSS, your hosting cost ≈ $50–$200 in Postgres + ClickHouse)
Sutrace	~$120 (ingest tier)

Helicone is cheapest if the cap holds. Sutrace and Langfuse are roughly comparable. LangSmith bites first at this scale.

Scenario C: Production agent platform (10M requests/mo, 20 seats, 50-span runs)

Tool	Monthly cost
LangSmith Plus	$780 seats + per-trace ≈ $2,500–$5,000
Helicone	Custom (well above the $200 cap) ≈ $1,500–$3,000
Langfuse Cloud Team	$199 + overage ≈ $600–$1,200
Langfuse self-host	Infrastructure ≈ $400–$1,500 + SRE time
Phoenix self-host	Infrastructure ≈ $400–$1,500 + SRE time
Sutrace	Ingest tier ≈ $400–$900

The crossover where Sutrace and Langfuse pull clearly ahead is around 1–3M requests/month. The crossover where self-host saves money over Cloud is around 5M, but only if you're not paying SRE time at $200/hr equivalent.

Framework lock-in: who locks you in

LangSmith. Soft lock-in to LangChain. The eval primitives, the Hub, the auto-tracing — all assume LangChain semantics. Migrating off requires rebuilding eval suites.
Helicone. Almost no lock-in. You change a base URL back, and you're out.
Langfuse. Modest lock-in via the SDK and trace data model. The OTel ingest path is clean if you instrument with OTel from day one.
Phoenix. OpenInference, which is OTel-aligned. Low lock-in.
Sutrace. OTel-native. Lock-in is minimal — point your collector elsewhere.

Self-hosting paths

Langfuse: First-class. Postgres + ClickHouse + Redis. Helm chart, Docker compose, Terraform modules. The clean choice if you need self-host.
Phoenix: OSS, runs as a Python service. Easier than Langfuse to start; harder to scale.
Helicone: OSS gateway. The full product (UI, billing, multi-tenant) is closed.
LangSmith: Enterprise on-prem only.
Sutrace: No self-host today. EU-resident managed.

Eval depth: the part that actually matters

Hamel Husain's eval FAQ — see our writeup — argues that prefab evals are the wrong primitive: "All you get from using these prefab evals is you don't know what they actually do…" The correct shape is custom annotation tools, written by your domain experts, against your specific data.

Given that, eval-tool quality is less about feature surface and more about: how easy is it to define a custom evaluator function, version it, and run it against a held-out dataset on a schedule.

LangSmith: Best ergonomics for LangChain users. Custom evaluators are decorators on Python functions. Datasets in UI.
Langfuse: Same shape, slightly less polished UI, fully OSS underneath. Custom evaluators well-supported.
Phoenix: Notebook-first. Excellent for one-off analysis, weaker for scheduled eval runs.
Helicone: Surface is shallow. You'll outgrow it fast.
Sutrace: Custom evaluators, datasets, scheduled runs, regression tracking. Framework-agnostic.

The "gateway + eval" hybrid pattern

Here's the recurring pattern in real teams that I want to call out explicitly.

A team starts with Helicone because setup is fast. The proxy gives them traces and caching. Six months later, they need real evals — LLM-as-judge, datasets, regression tracking. Helicone's eval surface is too shallow. They add Langfuse or LangSmith. Now they're running two tools.

Two tools means two bills, two SDKs (or one SDK + a base URL), two dashboards, two access-control configurations. It works, but it's not stable — sooner or later the team consolidates on one side. The gateway-only side is rarely it, because gateways are commoditised. The eval-side wins because the depth is harder to replicate.

This is the gap Sutrace was built to close. We do the gateway visibility (multi-provider routing tags, request logging) and the eval depth (datasets, LLM-as-judge, regression) and add the two layers nobody else does — hard budget caps that fire synchronously, and on-host PII redaction. Plus prompt-injection detection, which is no longer optional given the named EchoLeak / CamoLeak CVEs that shipped in 2025–2026.

Where Sutrace fits

Honest answer: we don't replace Langfuse for self-host. If you have an SRE team and your buyer requires self-host, Langfuse is the right call. We don't replace LangSmith for LangChain-native teams who want zero-config tracing inside LCEL — the ergonomic value of native integration is real.

We do replace the "I'm running Helicone for the gateway and Langfuse for evals" hybrid with a single managed tool. We do replace LangSmith for teams whose bills are climbing on per-trace overage and who don't live in LangChain. We do replace Helicone for teams who've hit the wall on evals or need budget enforcement, not just observation.

The three things we do that none of the four do:

Hard budget caps that fire synchronously. Not "alert me when spend exceeds X." Block the next provider call.
On-host PII redaction. The redactor runs in your VPC. The original prompt never leaves.
Prompt-injection signals as a span attribute. Every trace carries a detection score. With named CVEs shipping in MS Copilot and GitHub Copilot Chat in 2025–2026, this is no longer a niche feature.

For the AI-agent picture see the use case. For the broader stack unification — hardware, software, web/APIs, agents — see the Datadog comparison.

Citations and further reading

Soufian Azzaoui's 4-way DEV post — the trigger for this writeup
Helicone's competitor guide — surprisingly honest
Helicone vs LangSmith via Morph — pricing math
Confident AI's LangSmith alternatives
Mirascope LangSmith alternatives
Athenic 3-way
Langfuse self-hosting docs
ZenML Langfuse vs LangSmith
Maxim's 2026 top-5
Hamel Husain's eval FAQ

If you've made it this far and still have questions: pricing, DPA, or just try the use case page and run an integration. We don't take sales calls until you've tried it.