How We Evaluate LLMs: Beyond Benchmarks

A robust LLM evaluation, from test set creation to a confident model decision, typically takes 1-2 weeks for a well-defined task. This structured approach ensures you pick the right model for your specific needs, avoiding costly re-engineering months down the line when initial assumptions inevitably fail in production. The alternative is usually a gut feeling or relying on public benchmarks that rarely reflect your specific use case.

Most teams start LLM projects by hard-coding calls to api.anthropic.com or api.openai.com. They pick a model like gpt-4o or claude-3-opus-20240229 because it's the "best" or "newest." Then they run a few manual tests in a notebook or playground. If it looks okay, they ship it. This seems reasonable initially. It gets you to a proof of concept fast. The problem is, "best" is subjective and often doesn't mean "best for your specific task at your specific price point." We've seen clients commit to one model, only to realize six months later that a cheaper, faster model could achieve 95% of the quality for 10% of the cost. The refactor then becomes a painful, low-priority engineering task.

The Dainty Evaluation Framework

We use a structured framework whenever we need to select an LLM for a client's specific task. It's how we made decisions for projects like our Email Triage system and the Ghost Writer content generation service. This isn't about theoretical perfection; it’s about making a data-driven choice that holds up in production.

Here’s our process:

Define the Task & Success Metrics: Before touching any model, we define what "good" means. For Email Triage, "good" meant accurately classifying emails into one of five categories with high precision and recall, and low latency for user experience. For Ghost Writer, it's about generating human-like, grammatically correct content that adheres to specific brand guidelines. We set clear, measurable metrics: accuracy, F1 score, specific error types to avoid, average token count, and latency targets.
Build a Representative Test Set: This is non-negotiable. We create a JSONL or CSV file with 100-300 diverse examples of inputs and their ideal outputs. This set must cover common cases, edge cases, adversarial inputs, and data distributions expected in production. For Email Triage, this included short spam, long customer support queries, and internal memos. We don't just pull from the happy path; we actively seek out failure modes.
Automate Evaluation & Scoring: We write a Python script that iterates through the test set, sends each input to a list of candidate LLMs (e.g., gpt-4o, claude-3-sonnet-20240229, mixtral-8x7b-instruct-v0.1 via a provider like Together.ai), and captures the output. We then use a combination of programmatic checks (e.g., regex for output format, keyword presence) and a "scoring LLM" (often gpt-4o or claude-3-opus) to score the output against our defined metrics. For nuanced tasks, a small human review loop is integrated for a subset of results. Each run logs latency, token usage, and the computed quality score. Tools like Weights & Biases or a custom logging solution help track these metrics over time.
Calculate Cost-per-Quality-Point: This is where the rubber meets the road. We take the total cost (input tokens + output tokens) for processing the entire test set and divide it by the aggregate quality score. A cheaper model with slightly lower quality might be a better choice if its cost-per-quality-point is significantly lower. For example, if claude-3-haiku-20240307 delivers 90% of the quality of gpt-4o but at 1/20th the cost, its cost-per-quality-point will likely win for many high-volume applications.
Decision and Iteration: With these metrics, we make a clear decision. We don't just pick the highest quality; we pick the one that best balances quality, cost, and latency for the specific business objective. This evaluation isn't static. As models evolve (e.g., new versions released every few months in 2026), or our task requirements shift, we rerun this framework. It provides a repeatable, defensible process.

Where This Approach Breaks

While robust, this framework isn't always the right fit. If your LLM integration is a trivial, one-off script that runs daily on 10 inputs, investing a week in this evaluation is overkill. For tasks where "good enough" means 80% accuracy and the cost is negligible, manual spot-checking might suffice. This framework also adds complexity if your task definition or requirements are highly volatile and change weekly. Maintaining a representative test set and an automated evaluation pipeline requires discipline. If you're a small team building your first AI feature, a simplified version of this might be a better starting point. However, for any core feature that impacts customer experience or significant operational cost, this rigor pays dividends.

Practical Next Step

This week, pick one critical LLM-powered feature you're building or considering. Define what "success" looks like for that feature in concrete, measurable terms. Then, start building a small, representative test set – 20-50 examples of inputs and their ideal outputs. Focus on diversity, not just happy paths. Put this test set into a JSONL file. Even without a full automated pipeline, having this ground truth will immediately make your model selection and iteration process more objective. If you find yourself struggling to define "good" or build a diverse test set, that's often a sign that the problem itself isn't well-defined yet. For those looking to implement this kind of robust system or needing help defining their AI strategy, we regularly help clients architect and start a project with these evaluation systems.

We build production AI, not prototypes. If you're looking to ship something like what's described here — see how we work or start a project brief →