Why AI features fail in production
Most AI feature failures aren’t model failures. The models are good. The failures happen in the surrounding infrastructure: no retry logic when the provider times out, prompts that drift without version control, costs that compound silently, no graceful degradation, no eval pipeline to catch regressions before they hit users.
We’ve run 35 products on the same AI stack. We’ve hit these failure modes ourselves. The infrastructure work we offer to clients is the same layer we self-host for our own portfolio.
What we build
LLM gateway
A self-hosted gateway (LiteLLM) that sits between your application and every model provider. One endpoint, one API key management surface, one place to configure routing rules, failover, rate limiting, spend caps, and logging. When Anthropic releases a better model, you change a config value — not ten call sites.
- Route across Claude, Gemini, OpenAI, and open-weights models
- Automatic failover with configurable retry logic
- Per-tenant, per-feature spend caps enforced at the edge
- Request logging: prompt, response, tokens, latency, cost, model, tenant
Prompt versioning and CI
Prompts are code. They should be versioned, reviewed, and tested before they go to production. We set up a prompt store with version history and a CI step that runs your eval set on every prompt change — the same way unit tests run on every code change.
Eval pipeline
An eval pipeline runs representative inputs through your LLM feature and scores the outputs before each deploy. It catches regressions in LLM behaviour that don’t show up in code tests. We help you build the golden dataset (typically 30–100 examples from real production data) and wire it into your CI pipeline.
Queue workers and async processing
LLM calls that take more than two seconds should not be on the HTTP request path. We build async queue infrastructure (Bull on Redis, or n8n for orchestration-heavy workflows) that decouples LLM processing from user-facing response times and gives you retry, dead-letter, and priority handling for free.
Billing and usage metering
If you charge users per token, per call, or per successful action, the metering layer needs to be accurate, auditable, and integrated with your billing provider (Stripe) before you launch — not after the first billing dispute. We build the metering layer as a first-class component.
Observability
Traces, token counts, latencies, and costs — per feature, per tenant, per model, per time window. We instrument every LLM call and route telemetry to your existing observability stack or help you stand one up. Cost-per-action dashboards are a standard deliverable.
How the engagement works
Infrastructure engagements run 3–8 weeks and are typically delivered as building blocks your team owns and operates after handover. We don’t create vendor lock-in to our own tooling — we use open-source components (LiteLLM, n8n, Postgres) that you can run, fork, and modify independently.
The deliverable at the end is: running infrastructure, documentation, runbooks, and a handover session with your team.
Frequently asked questions
Do we need a gateway if we only use one model provider?
Yes — for spend caps, logging, and fallback handling alone. Provider outages happen. Spending overruns happen. Having no central logging layer means you’re flying blind on cost and reliability.
What if we already have some infrastructure in place?
We audit what you have, identify the gaps, and fill them. We don’t tear down working infrastructure for the sake of using our preferred tools.
Can you work with our existing cloud provider?
Yes. We’re cloud-agnostic. Docker-based deployments run on AWS, GCP, Azure, Fly.io, Hetzner, or bare metal. We default to whatever minimises operational overhead for your team.
How do we handle sensitive data in LLM calls?
The gateway layer is the right place to implement PII redaction before calls leave your infrastructure. We can build a redaction step into the gateway that strips or masks sensitive fields before they reach any external model provider.