all posts

ai agents

Helicone vs LangSmith vs Langfuse vs Phoenix — what each one actually gets wrong

A 4-way honest comparison of the leading LLM observability tools, the gateway-plus-eval hybrid pattern that emerged, and where Sutrace fits.

By Akshay Sarode· February 18, 2026· 14 min readllmai-agentsobservabilitylangsmith

Helicone vs LangSmith vs Langfuse vs Phoenix — what each one actually gets wrong

TL;DR. There are four serious LLM-observability tools shipping in 2026: LangSmith (best for LangChain teams, $39/seat Plus + per-trace overage), Helicone (fastest setup, $20/seat with $200/mo cap, weakest evals), Langfuse (Apache-2.0, the genuinely good self-host story, $50/mo Cloud Pro starter, MCP gap), and Arize Phoenix (OSS, OpenInference-native, sparse production tooling). All four have real gaps. The pattern teams keep landing on: a gateway (Helicone-style) for caching and routing, plus an eval-first tool (Langfuse or LangSmith) for the depth — two bills, two setups. This post is the honest tour: real pricing math, framework lock-in, self-hosting, eval depth, side-by-side, and where the consolidation goes from here. The triggering read is Soufian Azzaoui's DEV writeup of trying all four — the most honest field report I've seen. This is mine, with prices.

The category, briefly

LLM observability is a 2-year-old category that's already crowded. The shape: a tool that records traces of agent runs, attributes cost, scores outputs, and helps you debug regressions. The four leaders by Github stars, customer adoption, and community depth: LangSmith, Helicone, Langfuse, Phoenix. There are dozens of others — Maxim's 2026 top-5 list is a useful enumeration — but these four cover ~85% of the conversation.

The category split, in one sentence: gateway tools (Helicone) vs SDK-instrumentation tools (LangSmith, Langfuse, Phoenix). Gateway tools sit in front of the provider as a proxy and observe traffic. SDK tools sit in your code and observe spans. Different architectures, different trade-offs.

The four tools, with prices

LangSmith

The native observability product from the LangChain team. Best-in-class if you live in LangChain.

Price:

TierPriceIncludedNote
Developer$05k traces/moFree
Plus$39/seat/mo10k traces/seatPer-trace overage
EnterpriseCustomCustomSSO, on-prem

The trap: "Trace" is loosely defined and has shifted across releases. A long agent run can produce 50–200 spans that count as separate billable units in some configurations. Helicone's comparison walks the math; Confident AI's alternatives and Mirascope's roundup are the honest external reads.

What it gets right: Zero-config tracing inside LangChain. Eval primitives that match Runnable. The Hub for prompts.

What it gets wrong: Per-trace pricing turns exponential at scale. Framework lock-in to LangChain — you can use LangSmith without LangChain but the ergonomic value drops sharply. US-resident by default; EU available but not default. No budget enforcement, only observation. See the LangSmith alternatives breakdown.

Helicone

The proxy-first, base-URL-flip option. Fastest setup in the category.

Price:

TierPriceIncludedNote
Free$010k requestsFree
Pro$20/seat/mo$200/mo capCap is the unusual feature
EnterpriseCustomCustomSelf-host gateway available

What it gets right: Five-minute setup. Caching as a first-class feature (proxy architecture lets it cache transparently). Reasonable Pro tier with a usage cap — most competitors don't cap. Honest competitor guide.

What it gets wrong: Eval tooling is the weakest in the category — Soufian called it "afterthought." Budget control is observation, not enforcement. Proxy hop adds 10–30ms per request. Limited multi-provider routing visibility. See the Helicone alternatives breakdown.

Langfuse

The OSS observability tool that actually works.

Price:

TierPriceIncludedNote
Hobby (Cloud)$050k eventsFree
Cloud Pro$50/mo100k events+ per-event overage
Cloud Team$199/mo1M eventsSSO, longer retention
Self-host$0 (subscription)UnlimitedApache-2.0

What it gets right: Apache-2.0 license, no asterisks. Best self-host story in the category — the docs cover Postgres, ClickHouse, S3, OAuth. Strong eval tooling, competitive with LangSmith. Cost transparency. The ZenML comparison is honest.

What it gets wrong: Self-host means three databases (Postgres, ClickHouse, Redis) for your SRE team to operate. MCP support is partial as of early 2026. No budget enforcement, only observation. No on-host PII redaction. See the Langfuse alternatives breakdown.

Arize Phoenix

The OSS tool from Arize AI's team. OpenInference-native.

Price: Free (OSS). Arize sells a closed cloud (AX) on top, custom pricing.

What it gets right: OpenInference is the cleanest OTel-aligned semantic convention for LLMs. Strong tracing primitives. Notebook-first ergonomics for evaluation. Free.

What it gets wrong: Production tooling is sparse — alerting, multi-tenant access control, retention, and operational stability all lag the others. Most teams who try Phoenix end up using it for local dev and pairing it with a managed tool for production. The Arize cloud (AX) closes the gap but is enterprise-priced.

Side-by-side

DimensionLangSmithHeliconeLangfusePhoenixSutrace
SetupLangChain nativeBase URL flipSDK or OTelNotebook-firstOTel collector
LicenseClosed cloudClosed cloud, OSS gatewayApache-2.0 + closed cloudApache-2.0Closed cloud, OSS SDK
PricingPer-seat + per-tracePer-seat, $200 capPer-event tiersFree / custom AXPer-GB ingest + per-seat
Plus tier$39/seat$20/seat$50/moFreeFlat ingest
Self-hostEnterprise on-premOSS gatewayYes (good)YesNo
EU residencyAvailableUS defaultCloud EU availableSelf-host wherevereurope-west3 default
Eval depthStrongWeakStrongStrong (notebook)Strong
Budget enforcementObserve onlyObserve onlyObserve onlyObserve onlySynchronous interlock
On-host redactionNoNoNoNoYes
Prompt-injection signalsNoNoNoNoYes
MCP tracingPartialNoPartialNoNative
Multi-provider routingLimitedLimitedCustomCustomNative
Hardware/SCADANoNoNoNoYes

Real pricing math at three scales

Let's stop talking in tiers and do the actual numbers.

Scenario A: Solo dev / prototype (10k requests/mo, 1 seat)

ToolMonthly cost
LangSmith$0 (Developer tier)
Helicone$0 (Free tier)
Langfuse Cloud$0 (Hobby)
Phoenix$0 (OSS)
Sutrace$0 (Free tier)

All free. Pick on ergonomics.

Scenario B: Small team in production (500k requests/mo, 5 seats, 30-span runs)

ToolMonthly cost
LangSmith~$195 + overage on traces (50k/seat included) ≈ $250–$400
Helicone$100 (capped at $200/mo regardless)
Langfuse Cloud Pro$50 + overage ≈ $80–$150
Phoenix$0 (OSS, your hosting cost ≈ $50–$200 in Postgres + ClickHouse)
Sutrace~$120 (ingest tier)

Helicone is cheapest if the cap holds. Sutrace and Langfuse are roughly comparable. LangSmith bites first at this scale.

Scenario C: Production agent platform (10M requests/mo, 20 seats, 50-span runs)

ToolMonthly cost
LangSmith Plus$780 seats + per-trace ≈ $2,500–$5,000
HeliconeCustom (well above the $200 cap) ≈ $1,500–$3,000
Langfuse Cloud Team$199 + overage ≈ $600–$1,200
Langfuse self-hostInfrastructure ≈ $400–$1,500 + SRE time
Phoenix self-hostInfrastructure ≈ $400–$1,500 + SRE time
SutraceIngest tier ≈ $400–$900

The crossover where Sutrace and Langfuse pull clearly ahead is around 1–3M requests/month. The crossover where self-host saves money over Cloud is around 5M, but only if you're not paying SRE time at $200/hr equivalent.

Framework lock-in: who locks you in

  • LangSmith. Soft lock-in to LangChain. The eval primitives, the Hub, the auto-tracing — all assume LangChain semantics. Migrating off requires rebuilding eval suites.
  • Helicone. Almost no lock-in. You change a base URL back, and you're out.
  • Langfuse. Modest lock-in via the SDK and trace data model. The OTel ingest path is clean if you instrument with OTel from day one.
  • Phoenix. OpenInference, which is OTel-aligned. Low lock-in.
  • Sutrace. OTel-native. Lock-in is minimal — point your collector elsewhere.

Self-hosting paths

  • Langfuse: First-class. Postgres + ClickHouse + Redis. Helm chart, Docker compose, Terraform modules. The clean choice if you need self-host.
  • Phoenix: OSS, runs as a Python service. Easier than Langfuse to start; harder to scale.
  • Helicone: OSS gateway. The full product (UI, billing, multi-tenant) is closed.
  • LangSmith: Enterprise on-prem only.
  • Sutrace: No self-host today. EU-resident managed.

Eval depth: the part that actually matters

Hamel Husain's eval FAQ — see our writeup — argues that prefab evals are the wrong primitive: "All you get from using these prefab evals is you don't know what they actually do…" The correct shape is custom annotation tools, written by your domain experts, against your specific data.

Given that, eval-tool quality is less about feature surface and more about: how easy is it to define a custom evaluator function, version it, and run it against a held-out dataset on a schedule.

  • LangSmith: Best ergonomics for LangChain users. Custom evaluators are decorators on Python functions. Datasets in UI.
  • Langfuse: Same shape, slightly less polished UI, fully OSS underneath. Custom evaluators well-supported.
  • Phoenix: Notebook-first. Excellent for one-off analysis, weaker for scheduled eval runs.
  • Helicone: Surface is shallow. You'll outgrow it fast.
  • Sutrace: Custom evaluators, datasets, scheduled runs, regression tracking. Framework-agnostic.

The "gateway + eval" hybrid pattern

Here's the recurring pattern in real teams that I want to call out explicitly.

A team starts with Helicone because setup is fast. The proxy gives them traces and caching. Six months later, they need real evals — LLM-as-judge, datasets, regression tracking. Helicone's eval surface is too shallow. They add Langfuse or LangSmith. Now they're running two tools.

Two tools means two bills, two SDKs (or one SDK + a base URL), two dashboards, two access-control configurations. It works, but it's not stable — sooner or later the team consolidates on one side. The gateway-only side is rarely it, because gateways are commoditised. The eval-side wins because the depth is harder to replicate.

This is the gap Sutrace was built to close. We do the gateway visibility (multi-provider routing tags, request logging) and the eval depth (datasets, LLM-as-judge, regression) and add the two layers nobody else does — hard budget caps that fire synchronously, and on-host PII redaction. Plus prompt-injection detection, which is no longer optional given the named EchoLeak / CamoLeak CVEs that shipped in 2025–2026.

Where Sutrace fits

Honest answer: we don't replace Langfuse for self-host. If you have an SRE team and your buyer requires self-host, Langfuse is the right call. We don't replace LangSmith for LangChain-native teams who want zero-config tracing inside LCEL — the ergonomic value of native integration is real.

We do replace the "I'm running Helicone for the gateway and Langfuse for evals" hybrid with a single managed tool. We do replace LangSmith for teams whose bills are climbing on per-trace overage and who don't live in LangChain. We do replace Helicone for teams who've hit the wall on evals or need budget enforcement, not just observation.

The three things we do that none of the four do:

  1. Hard budget caps that fire synchronously. Not "alert me when spend exceeds X." Block the next provider call.
  2. On-host PII redaction. The redactor runs in your VPC. The original prompt never leaves.
  3. Prompt-injection signals as a span attribute. Every trace carries a detection score. With named CVEs shipping in MS Copilot and GitHub Copilot Chat in 2025–2026, this is no longer a niche feature.

For the AI-agent picture see the use case. For the broader stack unification — hardware, software, web/APIs, agents — see the Datadog comparison.

Citations and further reading

If you've made it this far and still have questions: pricing, DPA, or just try the use case page and run an integration. We don't take sales calls until you've tried it.