Most AI teams pay 5–30× too much because they send every prompt to the flagship model. Summarising an email costs the same as analysing a contract. Model routing is the decision layer that assigns each task to the cheapest model that still meets the quality bar. The six rules below are battle-tested from production rollouts — with numbers, code and a concrete stack.

Who this is for

CTOs, heads of AI, ML engineers and finance who see a number on the OpenAI / Anthropic / Google invoice that's bigger than they expected — and want to cut it without degrading the product.

Why LLM model routing is the #1 FinOps lever right now

Three numbers that frame the scale of the problem in 2026:

  • 30× — the input-token price gap between a flagship model (Claude Opus 4.7, GPT-flagship) and the “mini / haiku / flash” class from the same provider.
  • 70–85% — the share of calls in a typical B2B SaaS running an LLM under the hood that don't require the flagship model (classification, extraction, light rewrites, summarisation under 4k tokens).
  • 60–80% — typical bill reduction after rolling out a simple two-tier router with no measurable quality regression.

These are median results from the floomi audits we run for clients. The full methodology is laid out in our AI cost audit piece — this one focuses on the routing layer alone.

The problem is structural: engineering teams pick a model once — at MVP — and stay on it because “it works”. It does, but 80% of load is tasks where a model two tiers down would produce the same output for a fraction of the price. That's not a “later” optimisation — it's money left on the table that grows linearly with traffic.

Rule 1 — Classify tasks by required “IQ”, not by feature name

There is no such thing as “the model for our assistant”. The assistant handles 10–40 different task types, each with a different requirement profile.

A practical framework — three complexity classes:

ClassCharacteristicsExamplesRight model tier
L1 — MechanicalNo reasoning, format transformation, classification over a finite setTagging, invoice field extraction, sentiment, light rewriteHaiku / Mini / Flash
L2 — Compositional2–4 step reasoning, aggregation, inference from given dataLong-doc summarisation, RAG Q&A, short-code refactorSonnet / GPT-4.x mid
L3 — ExpertMulti-step reasoning, creativity, long context with subtle logicLegal analysis, complex-code debug, agent with 10+ toolsOpus / GPT-flagship

Implementation step: start with a list of the 20 most frequent call-paths in your system, assign a class to each, then compute current cost vs. post-routing cost. From our audits: ~70% of items turn out to be L1, ~20% L2, only 5–10% are real L3 — and you're paying flagship for 100%.

Rule 2 — Cheap-first cascade with escalation on failure

The simplest router that works is a cascade: cheapest model first, escalate only when the response fails validation.

# Pseudocode — cheap-first cascade
def route(prompt, schema):
    for model in ["haiku", "sonnet", "opus"]:
        response = llm_call(model, prompt)
        if validate(response, schema):
            return response, model
    raise EscalationFailed

The key piece is deterministic validation — JSON schema, regex on structure, checking the class is in a finite set, length, hash/checksum. Without validation the cascade has no way to recognise failure.

Case study — e-commerce client, support-ticket classification
SetupCost / moTraffic splitClassification F1
Before: 100% → GPT-4o$4,200all traffic on flagship0.93
After: Haiku → Sonnet → Opus$51084% Haiku · 13% Sonnet · 3% Opus0.93

Result: 88% savings, quality unchanged.

Caution

Don't escalate based on model self-assessment. A cheap model often confidently claims it gave a good answer when it's hallucinating. Validation has to be external and deterministic.

Rule 3 — Route on context length and type, not on model name

The second most important cost predictor is context length — and this is where most teams pay twice.

  1. Hard context-size limit in the router. Don't send a sub-4k prompt to a 200k-window model even if it's “the same Claude”. The mini variants are cheaper and equally effective for short prompts.
  2. Long-context-native models only when you must. Gemini with a 1M window makes sense for full codebase analysis, not for a 30-page PDF — that fits in 32k.
  3. Check whether the model is tier-priced. Some providers (Anthropic, Google) have a higher price bracket above a threshold (e.g. 200k tokens) — crossing it by 1 token roughly doubles the whole prompt's price.

Rule of thumb: if the median prompt in a call-path is < 8k tokens, a model with a ≤ 32k window is the right choice. Paying premium for “window just in case” is a classic trap.

Rule 4 — Embeddings classifier instead of an LLM-prompt router

The temptation: “I'll use an LLM to pick which LLM to send the prompt to.” Don't. Every LLM-router call adds 200–800 ms of latency and $0.001–0.005 per request — at scale it eats most of the savings.

Better: an embeddings-based classifier.

# Cheap, fast, deterministic router
embedding = embed(prompt)                    # ~$0.00002, < 50ms
class_probs = classifier.predict(embedding)  # local sklearn / xgboost
target_model = ROUTING_TABLE[class_probs.argmax()]
  • Embedding model:text-embedding-3-small from OpenAI, voyage-3-lite, or open-source bge-small. Cost: $0.01–0.02 per 1M tokens.
  • Classifier: plain logistic regression / XGBoost trained on 500–2000 labelled prompts. Trained on a laptop in < 1 hour.
  • Routing table: a static dict mapping classes → models.

Routing decision time: 40–80 ms, ~10× cheaper than an LLM router. Also fully deterministic — easy to test and debug. For projects with no training set: start with a simple keyword/regex router, collect logs for two weeks, then train the embeddings classifier.

Rule 5 — Prompt caching as a routing multiplier

Routing and caching are two orthogonal axes. Combined they produce a multiplicative effect — but only if your prompts have stable structure.

What to cache:

  • System prompt (typically 500–4000 tokens, identical across calls) — a cache hit drops its cost by ~90%.
  • Stable RAG context — docs, instructions, schemas.
  • Few-shot examples — especially when you reuse the same 5–10 examples across a whole task class.

Anthropic prompt caching: 5-minute TTL, cache write costs ~25% more than normal input, cache read is ~10% of the input price. Break-even at 2 calls inside the 5-min window.

OpenAI automatic caching: kicks in when prompt > 1024 tokens and the prefix repeats. ~50% discount on the cached portion. No action required, but prompt element order matters — variable data has to go at the end.

Routing + caching combined, in practice:

No optimisation:    100% × $1.00 = $1.00 / 1M req
Routing only:       20% × $1.00 + 80% × $0.05 = $0.24 / 1M req
Routing + caching:  20% × ($0.30 cached + $0.70 fresh) + 80% × $0.05 = $0.12 / 1M req

That's 8× less than baseline, with no change to any prompt. Two infrastructure layers.

Rule 6 — Measure cost per useful output, not cost per token

The biggest FinOps mistake for AI: the $/1M tokens metric. On its own it tells you nothing. The right metric: cost per useful output (CPUO).

CPUO = (total LLM cost for the task) / (number of tasks completed with business success)

What “business success” means depends on the feature:

  • Classification → correct class validated downstream.
  • Sales-email generation → email actually sent (not killed by review).
  • Contract analysis → report accepted by counsel with no rewrites.
  • Coding assistant → code merged to master.

Why this matters: a cheap model that hallucinates 30% of the time is more expensive than an expensive model with a 2% error rate — once you count retries, human escalation and the opportunity cost of a bad business decision.

Concrete example — pricing assistant in a wholesale company:

ModelCost / callSuccess rateCPUO
GPT-4o-mini$0.00271%$0.0028
GPT-4o$0.04096%$0.0417
Cascade mini→4o$0.00896%$0.0083

The cascade wins not just on price but also on CPUO — lower than pure mini (despite higher cost per call), because it doesn't produce the cost of bad pricing recommendations.

Model map — what to replace GPT-4o with in 2026

Reference table (verify current prices — they shift every quarter):

TaskAnti-pattern (expensive)Smart default (10–30× cheaper)Premium (when required)
Email / short-doc summarisationGPT-4o, OpusHaiku 4.5, GPT-4o-mini, Gemini Flash
Classification / taggingGPT-4oHaiku, Gemini Flash, fine-tuned Llama 8B
Structured-input data extractionGPT-4o, OpusHaiku, GPT-4o-mini with JSON modeSonnet (noisy data)
Internal-RAG Q&AGPT-4oSonnet 4.6, Gemini 2.x ProOpus (legal/medical)
Boilerplate code generationGPT-4o, OpusSonnet 4.6, GPT-4o-miniOpus (architecture)
Multi-step agent (5+ tool calls)Opus 4.7, GPT-flagship
TranslationGPT-4oHaiku, DeepL API, NLLB self-hosted
OCR + invoice extractionFull multimodal GPT-4oGPT-4o-mini multimodal, Gemini Flash

2026 price rule of thumb: flagship models run roughly $3–15 / 1M input tokens, mid-tier $0.50–3, the cheapest $0.05–0.30. A 30–100× spread — at a million requests per day, that's the difference between $1k and $100k per month.

Implementation stack — what to pick

Three realistic options, in order of increasing control:

1. OpenRouter — fastest start

  • 200+ models under one OpenAI-SDK-compatible API.
  • Manual routing (you pick the model) + automatic provider-down fallback.
  • ~5% markup over provider price — sometimes cheaper than direct (better rate cards).
  • No built-in classifier — you ship the decision layer.

When: prototype, small team, you want to test multiple models without writing integrations.

2. LiteLLM — open-source, self-hosted proxy

  • OpenAI-API-compatible proxy server.
  • Built-in router: cost-based, latency-based, load-balancing, fallback chains.
  • Cache (Redis), per-key rate limiting, observability (Langfuse, OpenTelemetry).
  • YAML config — fully declarative.

When: you have a DevOps team, want full control over routing and logs, don't want to hand metadata to a third-party platform.

3. Portkey / Helicone / Vellum — managed gateway

  • LiteLLM as SaaS, with UI, model A/B tests, prompt versioning, guardrails.
  • You pay for convenience and visibility — typically $0.001–0.01 per request.
  • Best for product / non-tech teams that want to change routing without a deploy.

When: mature product, lots of model experiments, need for audit/compliance.

Our take

For most B2B SaaS teams: start with LiteLLM proxy in Docker — 2 hours of DevOps work — add a simple embeddings classifier, wire Langfuse for logs. Full control, zero vendor lock-in, savings visible in month one.

Most common mistakes when rolling out model routing

  1. Routing without output validation. Cheap-first cascade only works if you can detect failure. Without it, quality silently degrades.
  2. Routing with an LLM. “Let GPT-4o decide which model to use” — adds cost and latency, is non-deterministic. Embeddings + classifier always wins.
  3. No per-feature accounting. One shared “AI” budget hides everything. Each feature/endpoint should be its own cost centre in logs.
  4. Optimising the invoice, not CPUO. A cheap model with a 40% error rate isn't cheaper — it's more expensive in hiding (retries, support tickets, churn).
  5. Routing without observability. Without logging {model_used, tokens_in, tokens_out, latency, validation_result} per request you can't tune. Langfuse, Helicone, OpenTelemetry — pick one, but log.
  6. Skipping caching. Routing buys 5–10× reduction, caching another 2–5×. Combined that's 10–50×. Doing only one of the two leaves money on the table.
  7. Leaving routing as a “TODO”. Rollout takes 1–2 days, ROI lands in week one. Every day you wait is money proportional to traffic burning down.

FAQ

Does LLM model routing work for small projects too?

Yes — the break-even is low. At $200–500/mo on LLMs, rolling out a basic Haiku → Sonnet router pays back in the first week. The bigger the traffic, the bigger the absolute gain, but the percentage saving is similar at any scale.

Isn't fine-tuning a small open-source model better than routing?

They're complementary, not competing. A fine-tuned Llama 8B / Mistral 7B on a specific task is often better and cheaper than any commercial model for that one task. Treat it as an addition to the router: fine-tune for narrow, high-volume task classes; use the commercial cascade for the rest.

Doesn't routing increase the risk of quality loss?

Only if deployed without metrics. With proper validation + CPUO monitoring routing actually improves quality — because you see which task classes really need the flagship and which don't. Most teams discover they had errors on the flagship too that they simply weren't measuring.

How long does it take to ship routing to production?

MVP version — LiteLLM proxy + static routing table by endpoint — 1–2 days for one engineer. Mature version with embeddings classifier, caching and full observability — 1–2 weeks. ROI typically in month one.

Can I start with a single provider (e.g. only Anthropic) instead of multi-vendor?

Yes — and that's often a good start. A pure Anthropic Haiku 4.5 → Sonnet 4.6 → Opus 4.7 cascade delivers 70–80% of the total savings available. Multi-vendor adds another ~10–20% and outage resilience, but complicates compliance and logging.

How do I know our team already has an AI cost problem?

Three signals: (1) the OpenAI / Anthropic / Google bill is growing faster than traffic, (2) > 60% of calls go to a single flagship model, (3) nobody on the team can tell you from memory what one end-user task costs. If you tick two of three, routing will pay back instantly.

Wrap-up

Model routing isn't a micro-optimisation — it's an architectural decision that in 2026 separates AI companies with a margin from companies subsidising OpenAI with their own runway. Six rules from this piece:

  1. Classify tasks (L1/L2/L3), not features.
  2. Cheap-first cascade with validated escalation.
  3. Route on context length, know provider price thresholds.
  4. Embeddings classifier, not an LLM router.
  5. Caching as multiplier — Anthropic 90% discount, OpenAI auto 50%.
  6. Measure cost per useful output, not per token.

Each rule on its own buys you 20–40% savings. Combined — 60–80%. Rollout: days, not months. Stack: LiteLLM + embeddings classifier + Langfuse. Vendor lock-in: zero.

Next step: list the 20 most frequent LLM calls in your system, assign a complexity class, and compare current cost vs. post-routing cost. If you'd rather have us do it with you — book an AI cost audit. The first diagnostic call is free; the report with concrete numbers and a rollout plan lands in 2 weeks.

— Andrzej Datta, floomi. Questions, comments, your own cases: hello@usefloomi.com.