Self-hosting LiteLLM: 6 months in production

The moment you add a second LLM provider to your stack – maybe Anthropic for long-context tasks, Google for multimodal, or OpenAI for general-purpose – you've introduced a significant new operational headache. Suddenly, your direct API calls are tightly coupled to a single vendor's API, rate limits, and pricing model. We've seen teams hard-code provider-specific logic across their codebase, leading to brittle systems that are a nightmare to debug or switch. After six months of self-hosting LiteLLM across projects like Email Triage and Ghost Writer, we can confidently say: abstracting your LLM calls behind a unified gateway isn't optional, it's essential for any serious AI product.

The Common Wrong Approach

Most teams start simple: direct API calls to api.anthropic.com or api.openai.com. It's fast to get a proof-of-concept running. You pull in the official client library, set an API key, and you're making requests. This approach seems reasonable until you need to add another provider, implement intelligent fallbacks when one service is slow, or track token usage consistently across different billing models. We've seen if/else spaghetti grow around provider_name variables, each branch handling specific API quirks, retry logic, and error codes. This leads to vendor lock-in by inertia, makes cost optimization through dynamic model switching almost impossible, and turns observability into a bespoke integration nightmare for every single LLM endpoint.

The Better Approach

Our recommendation, refined over half a year of production use, is to self-host LiteLLM as a unified API gateway. We deploy it as a Docker container, often within a Kubernetes cluster for resilience, exposing a single endpoint to our application services. Instead of direct calls, our code targets this LiteLLM gateway, treating all models as if they were behind a single, consistent OpenAI-compatible API.

The core of our setup is a config.yaml file. Here, we map logical model names like fast-triage-model or creative-writer-model to specific provider models (e.g., anthropic/claude-3-haiku-20240307 or openai/gpt-4o). This allows our application code to request fast-triage-model without knowing or caring which underlying provider is serving it. We manage API keys securely as environment variables, passed directly to the LiteLLM container.

LiteLLM handles critical infrastructure concerns out of the box:

Unified API: A single completion interface abstracts away provider-specific nuances.
Automatic Retries & Fallbacks: Configure max_retries and fallback_models directly in your config.yaml. If gpt-4o hits a rate limit, LiteLLM can automatically try claude-3-opus-20240229. This is invaluable for Email Triage, ensuring no customer request gets dropped.
Token Counting & Cost Tracking: LiteLLM normalizes token counts and provides cost estimates, giving us a single source of truth for spend across all providers.
Caching: We leverage LiteLLM's built-in caching for common requests, reducing latency and API costs for applications like BrightPath where certain prompts are highly repeatable.

This abstraction lets us dynamically route traffic based on cost, performance, or even specific model capabilities without changing application code. For Ghost Writer, we can send initial draft requests to a cheaper, faster model like gpt-3.5-turbo, then route final polish passes to claude-3-opus-20240229, all configured at the gateway level. If you're looking to unify your LLM infrastructure, consider Dainty's expertise to start a project and implement a robust gateway solution.

Where This Breaks

While LiteLLM is powerful, it's not a silver bullet. The biggest drawback is operational overhead. You're now running another critical service that needs monitoring, scaling, and patching. This adds complexity that a small team using only one or two models from a single provider might not justify. We've also encountered situations where LiteLLM's integration with a brand new LLM provider had subtle bugs or missing features that required workarounds or waiting for upstream fixes. Debugging can be trickier, as you're adding another layer between your app and the LLM API. While litellm.set_verbose(True) helps, understanding network issues or provider-specific errors can sometimes require direct API calls to isolate. Furthermore, while LiteLLM offers basic caching and observability hooks, it's not a full-fledged monitoring or cost management platform. You'll still need to integrate its logs and metrics into your existing infrastructure. Don't expect it to replace your Datadog or Prometheus setup.

Practical Next Step

If you're managing multiple LLM providers or anticipate doing so, dedicate an afternoon to a LiteLLM proof-of-concept. Start by pulling the LiteLLM Docker image: docker pull ghcr.io/berriai/litellm. Next, create a simple config.yaml to define two custom models, one mapping to OpenAI's gpt-4o and another to Anthropic's claude-3-opus-20240229. Define environment variables for your API keys (OPENAI_API_KEY, ANTHROPIC_API_KEY). Then, run the container: docker run -p 4000:4000 -v ./config.yaml:/app/config.yaml -e OPENAI_API_KEY=$OPENAI_API_KEY -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ghcr.io/berriai/litellm. Finally, from your application, send a simple completion request to http://localhost:4000/chat/completions, swapping the model parameter between your custom names. Experiment with fallback_models in your config.yaml to see the resilience in action. This minimal setup will quickly demonstrate the power of a unified gateway.

We build production AI, not prototypes. If you're looking to ship something like what's described here — see how we work or start a project brief →