all posts

ai agents

Hard budget caps for AI agents — the architecture options

From the RelayPlane $0.80→$47 stuck loop to the org-wide failure mode of provider spend caps — the four places you can put a budget cap, and why only one of them actually works.

By Akshay Sarode· March 12, 2026· 12 min readllmai-agentsobservabilitycost

Hard budget caps for AI agents — the architecture options

TL;DR. A real budget cap on an AI agent has to do three things: fire before the next provider call (not 30 minutes after), scope to the agent that ran away (not the whole org), and survive a stuck loop that's already streaming a long completion. There are four architectural options for where the cap lives — application-level (max_tokens, turn limits), provider-level (OpenAI/Anthropic post-hoc spend caps), gateway/proxy-level, and SDK-middleware-level. The RelayPlane case study — a single agent going from $0.80 per run to $47 per run overnight — kills three of the four. Application caps trip too late. Provider caps fire org-wide and after the spend. Gateways add latency and only see traffic that goes through them. SDK middleware (in-process, synchronous, agent-scoped) is the only architecture that fires fast enough to stop a $47 run before it becomes a $1,400 day. This is the long version with diagrams, real prices, and the trade-offs you should actually weigh. The Claude Code budget posture — see MindStudio's writeup — is the cleanest reference architecture I've seen.

The case study: $0.80 → $47 overnight

RelayPlane's runaway-cost analysis is the cleanest published example of an agent stuck loop. A coding agent with a vague stop condition entered a tool-call loop. The same input that yesterday cost $0.80 today cost $47. No code change. The model just kept reasoning. By the time the team noticed, three days had passed and they had burned $14,000.

SupraWall's runaway guide catalogues the variants — recursion through retrieval, tool-call ping-pong, "let me just check that one more time" loops, hallucinated tool calls that get re-attempted indefinitely. The DEV post on stopping cost blowups has detection patterns. The DEV "flat-fee era is over" post is about the broader economics — and connects to the Anthropic OpenClaw cutoff that turned subscription users into per-token customers in April 2026.

The shape of the failure is consistent: the agent makes more calls than expected, those calls are longer than expected, and the per-call price the agent uses is higher than expected (because it fell back to a more expensive model after a failure). The team's existing telemetry — usually a daily cost rollup — shows the spike one day late.

A daily rollup cannot save you. The cap has to fire synchronously, in front of the next provider call.

The four architectural options

Where can a budget cap live? In one of four places, plus combinations.

Option 1 — Application-level: max_tokens and turn limits

This is what every agent framework ships with. You set max_tokens=4096 on each call, you set max_iterations=10 on the agent loop, and you pray.

What it gets right: Free. No infrastructure. Fires per-call instantly.

What it gets wrong: Doesn't bound the cumulative cost of a run. Ten iterations × 4096 output tokens × $0.06/1k = $2.45 worst case. Twenty iterations on a more expensive model ≈ $20. The agent can hit max_iterations and exit gracefully — except the framework you're using might silently restart the agent on certain errors, or the developer wrote a wrapper that retries on timeout. Per-call caps don't compose into a per-run cap, and they definitely don't compose into a per-day or per-tenant cap.

Verdict: Required, but not sufficient. Set them anyway, but they're a per-call safety net, not a budget.

Option 2 — Provider-level: OpenAI / Anthropic spend caps

Both OpenAI and Anthropic let you set a monthly spend cap on your organisation. Hit it, the API returns an error.

What it gets right: Free. You don't have to build it. Last-resort defence.

What it gets wrong: Three things, all of them production-killing.

(a) It fires org-wide. If your billing-research agent goes runaway and trips the cap, your customer-facing chat agent goes dark too. Every tenant, every product surface. That's not containment. That's tripping the main breaker for the whole house.

(b) It fires after the spend. Provider caps are computed at invoice-reconciliation time, with up to several hours of lag. You can blow through the cap in real time and the API will keep accepting requests for a while. By the time the cap actually trips, you've spent past it.

(c) It only knows about your direct API spend. If you use OpenRouter, Bedrock, or any gateway, the provider cap doesn't see those calls.

Verdict: Set it as a last-resort backstop. Do not rely on it as a primary cap. The DEV "flat fee era is over" post covers why provider-level caps are systematically the wrong layer.

Option 3 — Gateway / proxy-level

Run a proxy in front of the provider. The proxy sees every request, attributes it to a tenant/agent, computes running cost, and rejects requests that would exceed the budget.

What it gets right: Centralised control. Visibility across providers (you can route through the proxy regardless of which provider you call). Synchronous — the cap fires before the request leaves the proxy.

What it gets wrong: Latency hop, usually 10–30ms p50, sometimes worse. If your proxy goes down, your agents go down (you need a circuit breaker that fails open or closed depending on policy). If your team uses any SDK that bypasses the proxy — or if a developer "just tests in prod" with their own API key — the cap is invisible to those calls. And the proxy can't see in-process state (variable values, decision branches in your agent code) the way an SDK can.

Verdict: Works if you can guarantee 100% of traffic goes through. In practice, that's hard for teams with many engineers and multiple SDK languages.

Option 4 — SDK middleware: in-process, synchronous, agent-scoped

The cap lives inside your application process. Every provider client (OpenAI, Anthropic, Bedrock, OpenRouter) is wrapped at SDK level. Before each request, the middleware checks the running cost against the budget. Reject if over.

What it gets right: Synchronous. Sub-millisecond overhead. Agent-scoped (the cap can know "this is the customer-chat agent" vs "this is the billing-research agent" because the SDK is in your code, not a network hop away). Survives misconfigured developer environments because the wrapping happens at SDK initialisation, not via traffic routing. Sees in-process state — a budget can be a function of the user's plan, the tenant's quota, the agent's role, not just a flat dollar limit.

What it gets wrong: Per-language. You need an SDK in every language your team uses (Python, TS, Go are the common ones). The wrapping has to happen at the right layer — wrap too low (HTTP client) and you lose semantic context; wrap too high (LangChain agent) and developers can bypass it by dropping to the raw client.

Verdict: This is the architecture that actually works. It's also the most operationally invasive — you need to ship the SDK and convince your team to use it.

Combination patterns: defence in depth

The right answer is not "pick one." The right answer is:

  1. Application-level caps as per-call safety netmax_tokens, max_iterations. Always on.
  2. SDK middleware as the primary budget interlock — synchronous, agent-scoped, sub-millisecond.
  3. Provider-level cap as last-resort backstop — only fires if the SDK middleware was bypassed.
  4. Gateway/proxy where centralised routing is needed — for teams that already run a gateway, layer the SDK on top of it.

What Claude Code does (the cleanest reference)

Claude Code's budget posture is the cleanest reference architecture I've seen ship in a real product. The structure:

  1. Per-task token budget, declared up-front.
  2. Streaming token counter, updated as tokens come back.
  3. Synchronous decision at each tool-call boundary: continue, request user confirmation, or stop.
  4. User-visible spend display so the human-in-the-loop can stop a run before the cap.

The key insight: the cap is not just "stop" — it can also be "ask for confirmation to continue." This is how Claude Code stays useful for long tasks while bounding cost. A flat hard-stop cap is brittle. A confirmation cap is collaborative.

For non-interactive agents (overnight batch, server-side), the confirmation cap collapses to a hard stop. For interactive agents, the confirmation cap is the right primitive.

What Sutrace does

We sit at the SDK-middleware layer. Wrap your provider client (one line for openai, anthropic, boto3 Bedrock, the OpenRouter SDK) and a budget interlock fires synchronously before each request.

The budget is declared in three dimensions:

  1. Per-run — the most common. "This agent run is allowed to spend up to $X."
  2. Per-day per-tenant — "Customer Acme is allowed $Y/day across all agents."
  3. Per-agent per-month — "The billing-research agent is allowed $Z/month across all customers."

When a budget is crossed, you choose the action: hard stop, confirmation prompt (for interactive agents), or downgrade to a cheaper model. The decision happens in <1ms, in-process, before the next provider call leaves your network.

We also offer an opt-in proxy mode for teams who can't or won't ship the SDK. The proxy mode adds the network hop but enforces the same caps. Most teams use the SDK; the proxy is for the long tail.

The cap is exposed as an OpenTelemetry GenAI semantic-convention attribute on every span — sutrace.budget.remaining, sutrace.budget.scope, sutrace.budget.action. Your dashboards see the running budget alongside latency and token usage. See the use case for the full picture and the multi-provider routing post for how it ties into provider attribution.

The honest trade-offs

Three things the SDK-middleware architecture can't do, and you should know about:

1. Cancel an in-flight long completion. Provider APIs do not expose cancellation for streaming completions. If a budget is set at $50 and a single 30-second streaming completion was already in flight, the running total might land at $50.30 by the time it completes. We block the next call, not the in-flight one. This is the architectural floor; nobody can do better. The mitigation: keep max_tokens low (per-call cap, fires inside the model) and let the SDK middleware handle multi-call cumulative caps.

2. See traffic that bypasses the SDK. A developer who imports openai directly without the wrapper, or uses curl to test, is invisible to the cap. Mitigation: wrap your SDK initialisation at module level (the wrapping happens at import) and run a CI check that fails builds importing the un-wrapped client.

3. Enforce a cap on a third-party agent service. If your code calls an external agent (e.g., an OpenAI Assistants API agent that runs entirely on OpenAI's side), the SDK only sees the API call, not the loop the assistant ran internally. Mitigation: scope budgets to the assistant invocation, not the inner steps.

These are real limits. Anyone who tells you their budget cap doesn't have these limits is misrepresenting their architecture.

How does this fit with the rest of the observability stack?

The cap is one of three things Sutrace does that nobody else in the LLM-observability category ships:

  1. Hard budget caps (this post)
  2. On-host PII redaction
  3. Prompt-injection signals (the EchoLeak / CamoLeak post)

LangSmith, Helicone, Langfuse, Phoenix all observe spend. None of them stop spend. That's the gap. If you're shipping agents in 2026 and your only defence against runaway is "I'll see it in the dashboard tomorrow," you should fix that this quarter.

Tools and references

For the broader picture see the LangSmith, Helicone, and Langfuse breakdowns — and the 4-way honest comparison.