Product Models Performance Pricing Docs Enterprise

Engineering · Jun 12, 2026

How "auto" routing cuts inference spend by 31% without quality loss

Mara Vance

Routing Lead · 9 min read

When teams wire up an LLM, they usually pin a single model and move on. It works — until the bill arrives, or the provider has a bad afternoon. The premise behind auto-routing is simple: most requests don't need your most expensive model, and the cheapest capable model changes by the hour.

Scoring a request

Every incoming request is scored along four axes before it's dispatched: estimated cost, expected latency, required context window, and live provider health. The router then selects the cheapest model that clears your configured quality bar.

score = w_cost · cost
      + w_lat  · latency
      + w_cap  · capability_gap
      + w_health · provider_health

In production traffic across a trailing quarter, this cut blended inference spend by 31% with no measurable drop in task success — because the heavy models are reserved for the requests that actually need them.

What we learned

Fallback matters as much as the initial pick. A request that reroutes on a provider blip should feel identical to one that didn't — same streaming, no surfaced error. Getting that invisible is most of the engineering.

"The cheapest capable model is a moving target. Routing is just keeping up with it in real time."

We're rolling these scoring weights out as per-key policies so you can tune the cost/quality tradeoff yourself. More on that soon.

Try auto-routing on your traffic.

Set "model": "auto" and watch the spend drop.

Get your API key

The privacy-first router for AI models. One API, every provider, zero data retention.

Product

Features Models Performance Pricing Docs

Company

About Enterprise Careers Blog

Trust

Security SOC 2 report Privacy Status

Terms Privacy DPA