use case

AI agent observability — watch every agent, every token, every prompt under one dashboard

Hard budget caps, on-host prompt redaction, multi-provider routing visibility, and prompt-injection telemetry for teams shipping LLM agents in production. EU residency by default.

AI agent observability — watch every agent, every token, every prompt

TL;DR. AI agents in production fail in three ways your existing observability stack cannot see: they leak data via prompt injection (EchoLeak CVE-2025-32711, CamoLeak CVE-2025-59145, both shipped in Microsoft and GitHub products this year), they burn cash in stuck loops (RelayPlane documented a single agent going from $0.80/run to $47/run overnight), and they silently re-route between providers in ways that break your eval baseline. Sutrace puts hard budget caps in front of every provider call, redacts prompts on-host before they leave your network, and emits OpenTelemetry GenAI spans tagged with the provider, model, and route the request actually took. One dashboard. EU residency by default (europe-west3). No per-agent pricing — flat ingest tier with cardinality tracked, not billed. If you're shipping agents and your only telemetry is the provider's invoice 30 days later, you are operating blind.

Start with one wrapped call

Install Sutrace around one OpenAI or Anthropic call first. You should see the first useful signal in under 10 minutes: model, provider, input tokens, output tokens, estimated cost, latency, status, and error details.

import OpenAI from "openai";
import { wrapOpenAI } from "@sutrace/llm";

const openai = wrapOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY }), {
  apiKey: process.env.SUTRACE_API_KEY,
  project: "support-agent",
  agent: "refund-router",
  budget: { dailyUsd: 20 },
});

const result = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Summarise this ticket." }],
});

The dashboard fills from that call. Add trace stitching, prompt redaction, and provider routing once the first signal is working.

This page is the long version. It covers the three failure modes in detail, what we actually do about each, the architecture, and the questions teams ask before they switch.

The three things that go wrong with AI agents in production

1. Prompt injection is shipping in named products

This is no longer theoretical. In 2025–2026 the same vulnerability class — indirect prompt injection — was assigned CVE numbers in two of the most widely deployed AI products on earth.

EchoLeak (CVE-2025-32711, CVSS 9.3) is a zero-click data exfiltration in Microsoft 365 Copilot. An attacker sends an email containing a hidden prompt. Copilot, when later summarising the user's inbox, follows the instructions in that email and leaks Outlook content, OneDrive files, and Teams chat history to an attacker-controlled URL. No user click required. Disclosed by Aim Security; covered in detail by HackTheBox's writeup and Checkmarx.

CamoLeak (CVE-2025-59145, CVSS 9.6) is a similar pattern in GitHub Copilot Chat. The injection lives in pull-request descriptions — hidden HTML that the model reads as instructions when a developer asks Copilot to summarise the PR.

Tenable's 7-vuln disclosure (5 Nov 2025) extended the same class into ChatGPT, hitting both GPT-4o and GPT-5. The lead researcher's quote, repeated in The Hacker News coverage, is the line every CISO should print and pin:

"Prompt injection is a known issue with the way that LLMs work, and, unfortunately, it will probably not be fixed systematically in the near future."

If the fix isn't coming from the model vendors, you need detection and containment in your own stack. That's a telemetry problem.

2. Stuck loops eat cash overnight

The classic example, documented by RelayPlane in their 2026 runaway-cost analysis: a coding agent with a vague stop condition hit a tool-calling loop. Per-run cost went from $0.80 to $47. Same input, same model. The agent simply kept reasoning. By the time the developer noticed, the team had burned $14,000 in three days.

SupraWall's runaway-cost guide catalogues a dozen variants — recursion through retrieval, tool-call ping-pong, "let me just check that one more time" loops. The DEV community guide on stopping AI cost blowups is a useful read on detection patterns.

The provider-side spend caps (OpenAI usage limits, Anthropic monthly cap) are real but flawed: they apply at the organisation level, fire after the spend, and trip every other agent in your account at the same time. If your billing agent goes runaway, your customer-facing agent goes dark too. That's not containment. That's a circuit breaker that takes the whole house out.

3. Multi-provider routing breaks your eval baseline

If you use OpenRouter, AWS Bedrock, or any LLM gateway, your "GPT-4o" trace might have actually been served by Azure OpenAI, OpenAI direct, or even a Together-hosted variant — depending on capacity and routing rules. OpenRouter exposes 400+ models across 60+ providers. The AWS Multi-Provider GenAI Gateway reference architecture is built specifically because teams need to fail over between providers without app changes.

Cool for resilience. Bad for evals. Your accuracy metrics regress and you cannot tell whether the model changed, the prompt changed, or the upstream provider got swapped under you. Without span-level provider tagging, you are debugging in the dark.

What Sutrace does about it

Hard budget caps that actually trip in time

We sit in front of every provider call as an opt-in proxy or via SDK middleware (Python and TypeScript). Each agent declares a budget — per-run, per-day, per-tenant — and the cap fires synchronously before the next provider request is sent. Not 30 minutes later. Not when the org's monthly cap is hit.

The cap interlocks at three levels:

App-level token budgets — max_tokens and turn-count limits per agent
Sutrace-level run budgets — dollar caps per run, per day, per tenant
Provider-level fallback — last-resort cap from OpenAI/Anthropic, but you should never rely on it

Each level is a defence. The middle layer — Sutrace — is the one that fires fast enough to stop a $47 run before it becomes a $1,400 day. We wrote up the architecture options in Hard budget caps for AI agents — the architecture options.

On-host prompt redaction

Prompts contain customer data. Customer data should not leave your VPC. We ship a redactor that runs on the host, before the request leaves your network. PII patterns (emails, phone numbers, IBANs, US SSNs, EU national IDs), secret patterns (API keys in 80+ formats), and customer-defined regex — all stripped or hashed before the OTel span is exported.

This is the difference between "we have audit logs of everything customers typed into our agent" (a GDPR liability) and "we have audit logs of patterns and counts" (a compliance asset). The second is what you want. See our DPA for the contractual side.

Provider routing visibility

Every span Sutrace emits carries the OpenTelemetry GenAI semantic-convention attributes:

gen_ai.system — the provider that served the request (openai, anthropic, aws.bedrock, azure.openai, openrouter)
gen_ai.request.model — what your code asked for
gen_ai.response.model — what the provider actually used (these diverge more than you'd expect)
gen_ai.usage.input_tokens / output_tokens
gen_ai.route.upstream — for gateway-routed traffic, the upstream provider behind the gateway

When your eval regresses, you can answer "did the model change?" in one query. Most teams cannot.

Prompt-injection signals

We score every prompt against a set of detector patterns — instruction overrides, role-confusion, hidden HTML/Markdown, base64 payloads, classic jailbreak chains. The score becomes a span attribute (sutrace.injection.score) and an alert rule. You don't have to act on it, but you can see it. NeuralTrust's stack guide is a useful primer if you want to roll your own; we built it because most teams won't.

How Sutrace compares to dedicated LLM observability tools

We're not the only entrant. The honest read on the alternatives:

LangSmith — best-in-class if you live in LangChain. Per-trace pricing gets exponential. $39/seat Plus is just the starting point.
Helicone — fastest setup, weakest evals. $20/seat with a $200/mo cap. Proxy-first.
Langfuse — Apache-2.0, the best self-host story. $50/mo cloud starter. MCP gap.

We unify them with the rest of your stack. If you also run hardware (SCADA/PLC) or have a Datadog bill that's eating you alive (see our Datadog comparison), the same dashboard handles all of it.

Frequently asked questions

How does Sutrace compare to LangSmith for production agents?

LangSmith is excellent for LangChain-native workflows. If your agents don't live in LangChain, you'll fight it. Sutrace is framework-agnostic — OTel-first, with SDKs that wrap the OpenAI / Anthropic / Bedrock clients directly. We also bill on ingest, not per-trace, so a chatty agent doesn't blow up your bill. See the full LangSmith alternatives breakdown.

Can I self-host Sutrace?

Cloud-only today, EU-resident. If self-hosting is a hard requirement, Langfuse is the honest recommendation — Apache-2.0, well-documented self-host path. We may offer a self-hosted tier in 2026; talk to us if it's blocking.

What does on-host redaction actually mean?

The redactor is a Python/TS library or a sidecar process. It runs in your VPC, scans the prompt before any provider request leaves, and replaces matched patterns with placeholders. The original is never sent to Sutrace. We see the redacted version. Your audit log is the redacted version. This is the only architecture that survives a strict DPO review.

How fast does the budget cap fire?

Synchronously. The next provider call is blocked at the SDK or proxy layer the moment the running total crosses the threshold. There's no batch reconciliation. The trade-off: a single in-flight call already sent cannot be cancelled — provider APIs don't expose cancellation. So a cap set at $50 might be tripped at $50.01 if a long completion was already streaming. This is the architectural floor; nobody can do better.

Does Sutrace work with multi-provider gateways like OpenRouter or AWS Bedrock?

Yes. We tag every span with both the gateway and the upstream provider. If your trace went app → OpenRouter → Anthropic, both legs are visible. We wrote this up in detail in the multi-provider routing observability post.

Do you support OpenTelemetry GenAI semantic conventions?

Yes — natively. We emit and ingest the standard gen_ai.* attributes. If you already have OTel instrumentation, point it at our endpoint. No SDK rewrite. No re-instrumentation. The tag space is the OTel one, not a proprietary Sutrace one.

How do you handle the Anthropic OpenClaw cutoff in April 2026?

The OpenClaw cutoff — Anthropic moving third-party agent tools off the flat $20–$200/mo subscription tier and onto per-token billing — broke a lot of teams' unit economics. Sutrace shows the per-token cost in real time and the budget cap fires regardless of whether you're on subscription or metered. The transition exposed how many teams were running with no telemetry at all.

What's the EU residency story?

Default. Ingest, processing, and storage all run in europe-west3 (Frankfurt). No US replication. SCC-bound DPA available before sign-up. See /legal/dpa. For German Verein, French SARL, and Dutch BV buyers this is usually the deciding factor.

Get started

Self-serve. Drop our SDK in front of your provider client (one line), set a budget cap, and the dashboard fills in within minutes. No sales call. Pricing here. EU-resident from day one.