DeeprouterDeeprouter
Product Models Performance Pricing Docs Enterprise
Sign in Get API key
← All posts
Engineering · Jun 12, 2026

How "auto" routing cuts inference spend by 31% without quality loss

Mara Vance
Routing Lead · 9 min read

When teams wire up an LLM, they usually pin a single model and move on. It works — until the bill arrives, or the provider has a bad afternoon. The premise behind auto-routing is simple: most requests don't need your most expensive model, and the cheapest capable model changes by the hour.

Scoring a request

Every incoming request is scored along four axes before it's dispatched: estimated cost, expected latency, required context window, and live provider health. The router then selects the cheapest model that clears your configured quality bar.

score = w_cost · cost + w_lat · latency + w_cap · capability_gap + w_health · provider_health

In production traffic across a trailing quarter, this cut blended inference spend by 31% with no measurable drop in task success — because the heavy models are reserved for the requests that actually need them.

What we learned

Fallback matters as much as the initial pick. A request that reroutes on a provider blip should feel identical to one that didn't — same streaming, no surfaced error. Getting that invisible is most of the engineering.

"The cheapest capable model is a moving target. Routing is just keeping up with it in real time."

We're rolling these scoring weights out as per-key policies so you can tune the cost/quality tradeoff yourself. More on that soon.

Try auto-routing on your traffic.

Set "model": "auto" and watch the spend drop.

Get your API key
DeeprouterDeeprouter

The privacy-first router for AI models. One API, every provider, zero data retention.

Product
Company
Trust
All systems operational© 2026 Deeprouter, Inc. All rights reserved.
TermsPrivacyDPA