all posts

ai agents

Hamel Husain was right — eval tooling is commodified, and that has implications for vendor selection

The 15 January 2026 update to Hamel Husain's eval FAQ argued prefab evals are the wrong primitive. Custom annotation tools are 10x faster. The implications for which LLM observability vendor you pick — and which you don't.

By Akshay Sarode· March 28, 2026· 11 min readllmai-agentsobservabilityevals

Hamel Husain was right — eval tooling is commodified, and that has implications for vendor selection

TL;DR. Hamel Husain updated his eval FAQ on 15 January 2026 with a quote that should make every LLM-observability vendor uncomfortable: "All you get from using these prefab evals is you don't know what they actually do…" His position — built on years of working with teams shipping production LLMs — is that prefab evals (the LLM-as-judge templates, the canned datasets, the vendor-specific evaluator libraries) are the wrong primitive. Custom annotation tools, written by your domain experts against your specific data, are 10× faster to iterate on and produce signal you actually trust. If that's true — and the field experience says it is — then the "eval features" in observability vendors are not a defensible reason to pick one over another. The defensible reasons are the layers around evals: cost telemetry, budget enforcement, multi-provider routing visibility, prompt-injection detection, and the tracing primitives that let you build the custom eval you actually need. This post is the long-form thesis: why Hamel's position is right, what it means for vendor selection, and where it leaves the LangSmith / Helicone / Langfuse / Phoenix comparison. Read the eval FAQ in full first. The argument here builds on his.

What Hamel actually said

The full quote, from the FAQ updated 15 January 2026:

"All you get from using these prefab evals is you don't know what they actually do. The evaluator is doing some opaque LLM-as-judge thing under the hood. You can't reproduce it, you can't debug it, you can't tell when it's wrong. Your domain experts can write a custom annotation tool in two days. That's a real eval."

The thrust: an eval is only useful if you trust the score. The score is only trustworthy if you can reproduce it, audit it, and explain it to a non-technical stakeholder. Prefab LLM-as-judge evaluators — the shape of "GPT-4 grades GPT-4's output against a generic rubric" — fail all three tests.

Hamel's alternative is custom annotation tools: small interfaces that let domain experts label outputs against your specific criteria. The labels become your dataset. The dataset drives whatever automated eval you eventually build. The automated eval runs are checked against the human labels, not against another LLM's opinion.

This is not a new argument — Anthropic's own evaluation guides have said similar things, and the academic literature on LLM-as-judge has documented its failure modes for two years. But Hamel's framing as "you don't know what they actually do" is the most concise statement I've seen of why the prefab path doesn't work in practice.

Why this is the right take

Three reasons.

1. LLM-as-judge has known failure modes that prefab evaluators don't surface. Position bias (preferring the first option), length bias (preferring longer responses), self-preference bias (preferring outputs from the same model family). Academic work has documented these consistently. A prefab evaluator that uses GPT-4 to judge GPT-5 output is going to be noisy in ways that depend on which specific failure modes are active for your task. You can't tell from the dashboard.

2. Generic rubrics don't catch domain-specific failure. The eval prompt that ships in your vendor's library is "is this response helpful and accurate?" Your customer's actual definition of "accurate" is "does it cite the right policy section number from our 200-page handbook?" The prefab evaluator gives the wrong response a passing grade because it sounds plausible. A custom evaluator that checks for the policy section number gives it a failing grade.

3. Custom annotation is genuinely fast. Hamel's claim is "two days." The teams I've watched build custom annotation tools confirm this — a Streamlit page, a Postgres table, an export-to-JSONL button is not a project. The vendor's prefab eval feature is a project to evaluate, integrate, learn the DSL of, and trust. Net time to a working eval pipeline is longer with the prefab path.

What this means for vendor selection

If eval features are not a defensible reason to pick a vendor, what is?

I think the answer is: the layers around evals. Specifically, five things.

1. Tracing primitives that let you build custom evals

Your custom eval needs raw spans — input, output, intermediate steps, retrieval results, tool calls. The vendor that gives you a clean tracing primitive (OTel-aligned, queryable, exportable) is the one your custom eval can be built on. The vendor that gives you a beautiful prefab eval but locks the underlying spans behind a UI-only abstraction is the one you'll fight.

Sutrace, Langfuse, and Phoenix are good here. LangSmith is good if you live in LangChain. Helicone is shallowest.

2. Cost telemetry per call, per tenant, per feature

This is observability, not eval, but it's load-bearing. If you can't tell what each agent run costs at the per-tenant level, you can't price your product, you can't enforce budget caps, and you can't decide which features to keep on the expensive model. See the OpenClaw post for why this matters now.

3. Budget enforcement, not just observation

Same point as the hard budget caps post: if your "eval" workflow includes a stuck-loop case, the vendor that stops the loop is the one that's actually saving you money. Vendors that observe spend without intervention are postmortem tools.

4. Multi-provider routing visibility

If you route through OpenRouter, Bedrock, or any gateway, your eval baseline drifts whenever the upstream provider changes. The vendor that tags each span with the actual upstream provider lets you cross-tab eval regression by provider. The vendor that doesn't, leaves you guessing. See the multi-provider routing post.

5. Prompt-injection detection

The named CVEs are shipping (EchoLeak / CamoLeak / Tenable's GPT-5 chain). Detection in your telemetry is the practical defence. Vendors that don't ship this in 2026 are behind.

Where this leaves the four leaders

Re-evaluating LangSmith, Helicone, Langfuse, and Phoenix with Hamel's frame:

LangSmith. Eval features are best-in-class within LangChain. But if eval features aren't the deciding factor, the LangSmith justification is "you live in LangChain" — which is real but specific. For teams not on LangChain, the per-trace pricing without the eval lock-in argument is harder to justify.

Helicone. Eval features are weakest in the category — Hamel's argument predicts this is OK because eval features aren't defensible anyway. The Helicone justification is gateway-and-cache, which is real and unrelated to evals.

Langfuse. Strong eval features and Apache-2.0. The argument for Langfuse holds — but mostly because of self-host and license, not because of the eval depth specifically. If you want managed and EU-resident, the eval features alone don't carry it.

Phoenix. Strong eval features for notebook work. Holds for notebook-first teams. Production tooling around it is sparse.

The honest reframing: pick your observability vendor on the layers around evals, not the eval features themselves. Build your custom evaluators on top of whichever vendor gives you the cleanest tracing primitive and the strongest support layers (cost telemetry, budget enforcement, multi-provider routing, prompt-injection detection).

What custom annotation actually looks like

For readers who haven't built one, the shape is simple. Here's the minimum viable annotation tool:

  1. A queue of recent agent runs, sourced from your observability vendor's trace API. Filter by date, tenant, agent, feature, whatever.
  2. A view of one run at a time, showing the input, output, key intermediate steps. Pretty-print the output.
  3. A label form with your domain-specific criteria. "Did the response cite the correct policy section? (yes/no/n.a.)" "Did the response avoid hallucinating numbers? (yes/no)" Maybe a free-text comment field.
  4. A submit button. The label gets stored in your own database with a reference back to the trace ID.
  5. An export-to-JSONL button. Your dataset.

Streamlit, Gradio, or a 300-line Next.js page can do this. The hard work is the criteria, not the tool.

Once you have 200–500 labelled examples, you can:

  • Use them as a held-out test set for prompt iteration. Every prompt change runs the agent against the test set, and you measure how many of the human-labelled criteria pass.
  • Train an automated evaluator that's calibrated to your humans. The automated evaluator runs at every commit; the humans label new examples weekly to keep the calibration honest.
  • Build regression dashboards — prompt versions on the X-axis, pass-rate on the Y-axis.

This is the workflow Hamel argues for. It's the workflow most teams ship in production at scale. Vendor prefab evals are a faster start that becomes a slower system. Custom annotation is a slower start that becomes a faster system.

What Sutrace does

We don't try to compete on prefab evals. We give you:

  • Clean OTel-native tracing — the raw spans you need to build custom evaluators
  • Trace export to your annotation tool, in JSONL or via API
  • Datasets and scheduled runs — when you've built a custom evaluator, we run it on a schedule against held-out data
  • Regression tracking — pass-rate over time, per-prompt-version

The eval surface area is intentionally small because Hamel is right — the prefab eval features are not defensible. We invest in the layers around evals: budget enforcement, on-host PII redaction, prompt-injection signals, multi-provider routing visibility. See the AI agent observability use case for the full picture.

The honest take for vendor selection

If you're choosing an LLM-observability vendor in 2026, run this test:

  1. Strip the eval features from your evaluation. Pretend they don't exist. What's left to differentiate vendors?
  2. Score the remaining layers. Tracing depth. Cost telemetry. Budget enforcement. Multi-provider routing. Prompt-injection detection. Self-host story. EU residency. Pricing model.
  3. Pick on those. Use whatever eval primitives the vendor ships as a starting point, knowing you'll outgrow them and write custom annotators within a quarter.

If you do the test honestly, the vendor mix shifts. LangSmith without the LangChain native integration becomes harder to justify. Langfuse with Apache-2.0 self-host wins for license-sensitive buyers. Helicone with $20/seat wins for cost-sensitive starters. Sutrace wins for teams who want the budget-enforcement and prompt-injection layers that nobody else ships.

This isn't a sales argument. It's the framing Hamel's argument forces if you take it seriously. We benefit from teams taking it seriously because our differentiation is in the layers around evals, not in the eval features themselves. But you should run the test honestly regardless of which vendor it lands on.

Citations and further reading

For the rest of the cluster see Hard budget caps for AI agents, the EchoLeak / CamoLeak post, and the 4-way honest comparison.

If you've been picking observability vendors on eval features alone, the next time your eval suite stalls on a prefab judge that gave a hallucinated response a passing grade, remember Hamel's quote. You don't know what they actually do. That's the real risk you're carrying.