The problem with a single provider
Most AI applications start the same way: pick a model, hard-code the API call, ship it. That works fine until the model you chose gets deprecated, its price doubles, a competitor releases something significantly better, or it goes down at the worst possible moment. Any of these events requires a code change across every place you called the API directly.
We learned this lesson early and now every product we build — whether for our own portfolio or for clients — routes LLM calls through a single gateway. The application code sees one endpoint. The gateway decides which provider gets the request.
How the gateway works
We self-host LiteLLM as the primary gateway. It presents an OpenAI-compatible API so the application code never changes — you call POST /chat/completions with a model name, and LiteLLM translates that into the right provider API call.
For products that need access to models outside LiteLLM’s direct integrations, or where we want managed fallback and routing, we add OpenRouter as a second layer. This is the setup behind BrightPath, where different lesson types use different models depending on the required reasoning depth and cost envelope.
The gateway gives us:
- Request logging. Every call, latency, token count, and cost is recorded. This is how we catch a prompt change that accidentally doubled token usage.
- Rate limit handling. The gateway retries on 429s and switches providers on hard limits. The application sees a result, not an error.
- Cost tracking per project. We tag requests with a project ID and generate weekly cost reports. Without this, costs drift invisibly.
- Zero-downtime model upgrades. When Anthropic released Claude 3.5 Sonnet, we changed one config value. No deploys, no PR reviews, no regression testing of API call sites.
Our model routing rules
Not all tasks need the same model. We route based on three criteria: reasoning requirement, latency tolerance, and cost per call.
Claude Sonnet — reasoning and long context
We default to Claude for tasks that require sustained reasoning, careful instruction following, or long context windows. This includes: writing first drafts from complex briefs, tutoring conversations in BrightPath, contract and document analysis, and any task where a wrong output has meaningful downstream consequences.
Claude is our most expensive default, but the quality floor is higher than the alternatives on tasks where the prompt is complicated and the output needs to be consistent.
Gemini Flash — high-volume, fast, cheap
For high-throughput tasks where latency matters and cost compounds fast — real-time classification, suggestion generation, short completions — Gemini Flash is usually the right call. It’s significantly cheaper per token than Claude Sonnet and fast enough for interactive use. BrightPath uses it for inline hint generation during lessons: needs to be instant, doesn’t need to be brilliant.
GPT-4o — when client infrastructure requires it
We occasionally default to GPT-4o on client engagements where the client already has Azure OpenAI credits, enterprise agreements, or compliance requirements that tie them to Microsoft infrastructure. The gateway makes this a config change, not a rebuild.
When to introduce routing
You don’t need a gateway on day one. If you have one model doing one task, a direct API call is fine. Introduce a gateway when:
- You have more than two distinct AI workflows with different requirements
- You’re spending more than $500/month on model API calls
- You need request logging for debugging or compliance
- You want to experiment with model switching without code changes
Setting up LiteLLM takes about a day. We include it as standard in our AI infrastructure engagements.
One thing people get wrong
Teams often try to pick “the best model” once and use it everywhere. There is no best model — there is the right model for a given task at a given cost point. The goal of a routing layer is to match tasks to models correctly, not to standardise on a single provider out of preference or familiarity.
If you’re running the same frontier model for both a 10,000-token document analysis and a two-sentence category label, you’re probably overpaying on one and correctly spending on the other.