ai agents
Multi-provider LLM routing — which provider actually served that trace?
OpenRouter, AWS Bedrock, and the gateway pattern made multi-provider routing the default. Without span-level provider attribution, your eval baseline is a coin flip. The OTel GenAI semantic conventions are the answer.
Multi-provider LLM routing — which provider actually served that trace?
TL;DR. Most teams who ship LLM agents in 2026 are routing through a gateway — OpenRouter (400+ models, 60+ providers), AWS Bedrock (multi-vendor), or a custom proxy following the AWS Multi-Provider GenAI Gateway reference architecture. The architecture is sensible — failover, cost optimisation, vendor independence. The observability problem: when your eval baseline regresses, you need to answer "did the model change, did the prompt change, or did the upstream provider get swapped under us?" Without span-level provider attribution, you're guessing. The fix is the OpenTelemetry GenAI semantic conventions: every span carries gen_ai.system, gen_ai.request.model, gen_ai.response.model, and (for gateway-routed traffic) the upstream provider that actually served the call. This post walks through the patterns, the OTel attributes, the gateway choices, and what Sutrace tags by default. The Agenta top LLM gateways roundup and TrueFoundry's multi-model routing post are the best background reads.
Why teams are routing through gateways
Three reasons, all of them economic.
1. Failover. When OpenAI has a P0 incident — every quarter or two, on average — your customer-facing agent goes dark. Routing through a gateway lets you fail over to Anthropic or Bedrock. The cost is one weekend of integration work; the upside is not getting paged at 3am the next time GPT-5 returns 503s for an hour.
2. Cost optimisation. Different providers price the same capability differently. Claude 4 Opus is cheaper for some workloads, GPT-5 cheaper for others, Bedrock-hosted Llama-4-405B cheaper still. A gateway with routing rules can pick the cheapest upstream that meets latency and quality thresholds.
3. Capacity. Provider rate limits hit harder than they should. Routing through multiple providers gives you elastic capacity across vendor accounts.
The most popular gateway choices, in rough order of adoption:
- OpenRouter — 400+ models from 60+ providers, BYO API key model, single OpenAI-compatible endpoint. The hobbyist favourite that's quietly become a production default.
- AWS Bedrock — multi-vendor (Anthropic, Meta, Cohere, Amazon's own Nova, Mistral) inside a single AWS account. The enterprise pick.
- Custom proxy — typically using a framework like LiteLLM or BerriAI, deployed in-cluster, with org-specific routing rules.
- Vendor-specific gateways — AzureOpenAI's deployment routing, Google Vertex's model garden routing.
TrueFoundry's multi-model routing post is the cleanest external read on the patterns. Agenta's gateway roundup covers the vendor space.
The observability problem this creates
Your code says model="gpt-4o". The gateway sees the request, applies its routing rules, and sends it upstream to one of: OpenAI direct, Azure OpenAI, a Together-hosted variant, or a fall-back to Anthropic Claude with a translation layer. The response comes back. Your code logs "called gpt-4o, got response, here's the latency."
Three weeks later, your eval suite shows a regression. Accuracy on a held-out test set dropped 3%. You ask: did the prompt change (no, git says clean), did the model change (no, your code still says model="gpt-4o"), or did the upstream change? You don't know. Your traces don't say.
The gateway might have recorded which upstream it routed to — but that record is in the gateway's logs, not in your spans. Joining them post-hoc is hard. If you're using OpenRouter, you literally can't — they don't expose per-request upstream attribution.
This is the gap.
The OTel GenAI semantic conventions
The OpenTelemetry GenAI semantic conventions are the answer. They define a standard set of attributes that every LLM-related span should carry. The relevant ones for routing:
| Attribute | Meaning |
|---|---|
gen_ai.system | The provider that served the request (openai, anthropic, aws.bedrock, azure.openai, openrouter) |
gen_ai.request.model | The model your code asked for |
gen_ai.response.model | The model the provider actually used (these diverge more than you'd expect) |
gen_ai.usage.input_tokens | Input tokens consumed |
gen_ai.usage.output_tokens | Output tokens generated |
gen_ai.operation.name | chat, embedding, completion, etc. |
Sutrace also emits two attributes that aren't in the upstream spec yet but follow the same semantic-convention shape:
| Attribute | Meaning |
|---|---|
gen_ai.route.gateway | The gateway your code talked to (openrouter, bedrock, litellm-proxy) |
gen_ai.route.upstream | The actual upstream provider behind the gateway, when known |
For a request that goes app → OpenRouter → Anthropic Claude, the span carries:
gen_ai.system = "openrouter"
gen_ai.request.model = "anthropic/claude-4-opus"
gen_ai.response.model = "claude-4-opus-20260201"
gen_ai.route.gateway = "openrouter"
gen_ai.route.upstream = "anthropic"
When the eval regresses, you can query "show me all traces where gen_ai.route.upstream changed in the last 30 days." If the answer is "OpenRouter started routing 40% of traffic to a Together-hosted variant of Claude," you have your answer in one query.
How to instrument
Path 1: OTel auto-instrumentation
If you use OpenAI, Anthropic, or Bedrock SDKs in Python or TS, the OpenInference auto-instrumentation packages emit the standard gen_ai.* attributes for free. Install openinference-instrumentation-openai (or the equivalent), point your OTel collector at Sutrace, and the spans appear with the standard attributes.
Path 2: Manual instrumentation
If you're building your own client wrapper, set the attributes yourself when you create the span. The shape is straightforward — a span around the LLM call, attributes set from the request and response.
Path 3: Gateway-side instrumentation
Some gateways emit OTel spans directly. LiteLLM has experimental OTel support. OpenRouter does not (as of April 2026). For OpenRouter, you have to instrument client-side and infer the upstream from response headers (which OpenRouter does include — check openrouter-served-by).
For Bedrock, the Boto3 client emits enough information that the upstream provider is straightforward to derive from the model ID.
What Sutrace does by default
Three things, automatically.
1. Tag every span with gen_ai.system and route attributes. If you use the Sutrace SDK middleware, this happens at SDK initialisation. If you use OTel auto-instrumentation, our collector enriches incoming spans with the route attributes when it can derive them from response metadata.
2. Build the routing topology view. A dashboard that shows the gateway-and-upstream graph for your traffic over the last N hours. You can see at a glance "70% of GPT-4o requests went to OpenAI direct, 30% to Azure OpenAI." Drift here is a leading indicator.
3. Eval regression cross-tab. When your eval suite runs against a baseline, the regression report cross-tabs by gen_ai.route.upstream. If accuracy regressed and the upstream mix changed in the same window, that's flagged.
A worked example
Suppose you ship a customer-support agent using OpenRouter, with anthropic/claude-4-sonnet as the model. Eval baseline runs nightly.
Day 1–14. Eval at 0.94 accuracy. gen_ai.route.upstream is 100% anthropic.
Day 15. Anthropic has a regional incident. OpenRouter's routing rules fail over 60% of traffic to a Together-hosted Claude variant for 4 hours. Eval that night runs at 0.88 — a 6-point drop.
Without route attribution: you spend a day suspecting the prompt, the retrieval, the model, the eval set. Six engineers, eight hours.
With route attribution: the eval report cross-tab shows the regression cluster matches the time window where gen_ai.route.upstream = "together". The Sutrace SDK emits this without you having to do anything. You confirm in 5 minutes, lock OpenRouter to Anthropic-only via routing rules, and move on.
Pitfalls to avoid
1. Don't trust the gateway's billing dashboard for accuracy. Gateways typically attribute cost to the upstream they routed to, but the model name in their dashboard is the one your code asked for, not the one the upstream actually served. The two diverge for compatibility-shimmed routes.
2. Don't assume model="gpt-4o" means GPT-4o. Even at OpenAI direct, gpt-4o is a versioned alias that points to whatever the current production weights are. The actual model version comes back in gen_ai.response.model. Track both.
3. Don't forget Bedrock. Bedrock model IDs include both vendor and capability — anthropic.claude-4-sonnet-20260201-v1:0 is unambiguous. But teams often record only the friendly name. Tag both.
4. Don't assume rate limits are per-provider. When you route through a gateway, rate limits apply at the gateway level too. Hitting OpenRouter's per-account rate limit is just as production-impacting as hitting OpenAI's.
5. Don't lose visibility through Bedrock cross-region inference. Bedrock's "cross-region inference" feature transparently routes between regional model endpoints for capacity. Your span needs to record which region actually served the call — aws.region as a span attribute. We've seen teams chase a latency regression for a week before discovering Bedrock had silently moved 30% of their traffic from us-east-1 to us-west-2 and the additional propagation delay was the cause.
6. Don't ignore the routing-rule audit trail. Most gateways let you change routing rules via dashboard or API. Those changes need to be tracked. Without an audit trail, "the eval started regressing on Tuesday" is a mystery; with one, "the eval regressed at 14:00 Tuesday because the routing rule changed at 13:55" is the answer.
A short tour of the major gateways' span quality
If you're picking a gateway today and observability matters:
- AWS Bedrock. Native CloudWatch + X-Ray support. The Boto3 client emits enough information to derive the upstream cleanly. Cross-region inference is the gotcha (see above). Bedrock model IDs are the most semantically clean of any vendor.
- OpenRouter. Response headers (
openrouter-served-by,openrouter-model) are the canonical attribution source. There's no native OTel emission. The OpenInference instrumentation reads the headers and tags the spans correctly. - LiteLLM proxy. Experimental OTel emission. Configurable. The most flexible if you self-host.
- Azure OpenAI deployment routing. Each deployment is a logical alias to a specific model version on a specific region. The deployment name is an unambiguous record. Treat the deployment name as part of
gen_ai.request.model. - Custom proxies. You write the OTel emission. Make sure to set both
gen_ai.route.gatewayandgen_ai.route.upstream.
Tools and references
- OTel GenAI semantic conventions — the spec
- AWS Multi-Provider GenAI Gateway reference architecture — the enterprise blueprint
- TrueFoundry multi-model routing
- Agenta top LLM gateways
- OpenInference instrumentation packages — for SDK auto-instrumentation
How this fits with the rest of the stack
Multi-provider routing visibility is one of three things Sutrace ships that the rest of the LLM-observability category doesn't by default — alongside hard budget caps and prompt-injection signals. For the full picture see the AI agent observability use case, the LangSmith comparison, or the Helicone comparison for the proxy-vs-SDK trade-offs.
If you're already routing through a gateway and your traces don't tell you which upstream served each call — fix that this sprint. The eval-regression mystery you'll save yourself from is worth the integration work alone.