The Best Agent Evals Come From Production Failures, Not Design Sessions

I shipped my first production agent a few years back. Before it went live, I spent two weeks designing what I was confident was a thorough set of agent evals. Fifty test cases. Graded on task completion, tool call accuracy, and response quality. I was proud of it.

Three days after launch, a user submitted a request the agent had never seen in any of my test cases. The agent looped. It called the same tool four times, got confused by the third response, and produced output that was confidently wrong.

None of my carefully designed evals had caught it. Because I had written evals for what I thought users would do. Not for what they actually did.

That is the central tension in agent evaluation. You write tests before you have real data. Then the data shows you everything you missed.

Listen to this article

For a visual guide to all 9 concepts, see the Agent Evals infographic.

The blank canvas problem#

One of the biggest reasons teams skip agent evals entirely is not laziness. It is the blank canvas problem. You sit down to write your first eval and you do not know where to start. What should I test? How many cases do I need? What counts as passing?

The problem compounds the more you think about it. Agents are non-deterministic. The same input can produce different outputs on different runs. The task space for most real-world agents is enormous. You cannot enumerate it fully, and any attempt to do so before shipping feels like guesswork.

So teams write a few test cases, feel unconvinced they are comprehensive enough, and move on to building features instead. The eval suite never happens.

The reframe that helped me most: stop trying to design a comprehensive eval suite. Start with unit tests.

Not unit tests for your code. Unit tests for your agent's behavior. Pick the three behaviors that matter most to your product. Write one test case for each. Get a number. That number, even from three cases, is infinitely more useful than zero cases.

Small, specific evals build momentum. They give you a real metric. And they create a foundation you can grow from as production traces accumulate.

What evals actually are and are not#

An eval, at its simplest, is this: give the agent an input, apply some grading logic to its output, measure success.

That is it. An eval is a test.

What makes agent evals different from standard unit tests is that the grading logic is often itself a model. You cannot write a regex that checks whether an agent's response "correctly understood the user's intent and took the right action." So you use an LLM as the judge. You define a rubric, you run the output through the judge model, and you get a score.

This is powerful. It is also where many teams go wrong. LLM-as-judge is not ground truth. It is an approximation. The judge model has its own biases and its own failure modes. Your eval results are only as reliable as your judge and your rubric.

What evals are not: a one-time certification that your agent is ready. They are a continuous measurement practice. You run them before you ship a change. You run them after. You run them when you swap the model, when you change the prompt, when you update tools. Any time something in the system changes, you run the suite and check whether the numbers held.

The other thing evals are not is optional. Agents operating without evals are being changed and shipped on instinct alone. You might fix something in Tuesday's prompt update and silently break something that was working last week. You will not know until a user tells you.

Evals are the operating system for agent quality. You can build without them. You cannot build reliably without them.

Why the best evals come from traces#

I have never written a genuinely good eval before seeing a failure in production first.

That is a strong claim. Here is what it means in practice.

When you design evals before shipping, you are making bets about what users will do and what will break. Those bets are informed by your experience and your product knowledge. They are not uninformed. But they are incomplete in ways that are hard to anticipate because you have not seen real usage yet.

Real users surprise you. They phrase requests in ways you did not model. They combine features in unexpected sequences. They bring context you did not know about. The failures that result from those surprises are exactly the edge cases your agent needs to handle.

When a failure happens in production, you have something genuinely valuable: a trace. A record of what the user sent, what the agent did at each step, which tools it called, what those tools returned, and what the final output was. That trace is the raw material for a precise eval case.

The workflow becomes:

Agent ships with a small initial eval suite
You instrument traces in production
A user hits a failure case
You pull the trace, understand why the failure happened
You write an eval case from that trace
You fix the agent behavior
You run the eval suite to confirm the fix
The new case stays in the suite permanently

That last step matters. Once you write an eval from a real failure, it does not leave. You have made a permanent non-regression commitment. That class of failure will not return undetected.

This is how your eval suite grows from 3 cases to 30 to 300. Not from design sessions. From production.

Agent = fit(model, evals)#

This is a framing I want to be precise about, because it changes how you think about what evals are for.

In a rough sense: Agent = fit(model, evals).

Your agent is the product of the model you chose, shaped by the evals you optimized against. Good evals produce a good agent. Bad evals, or no evals, produce an agent that may perform well in your test environment and fail in nearly every real-world condition you did not anticipate.

The implication is uncomfortable but useful: evals are not just tests. They are the specification of what you want the agent to be. They map out the territory you expect the agent to navigate. The agent learns that territory by being measured against it repeatedly.

If you only write evals for happy paths, you are specifying an agent that handles happy paths. If you write evals that include edge cases, ambiguous inputs, and failure recoveries, you are specifying a more robust agent.

This also reframes capability evals. Some evals will fail today not because your agent engineering is bad, but because the task is genuinely too hard for current models. That is not a reason to remove the eval. It is a flag on the map that says "this is where the next generation of models or the next round of agent engineering needs to take us." Aspirational evals are legitimate evals. You run them, note the pass rate, and keep improving toward them.

The sim2real gap kills your numbers#

The sim2real gap is a concept from robotics. It describes the difference between how a robot performs in simulation and how it performs in the real world. The bigger the gap, the less useful your simulation results are for predicting real-world behavior.

Agent evals have the same problem.

Your eval environment and your production environment are never identical. The question is how large the gap is and how much it distorts your numbers.

The gap appears in predictable places. Your eval might use a fixed set of tool responses, but production tools return live data that changes. Your eval might run against one model version, but production has quietly rolled to a newer one. Your eval environment might not replicate the context window depth your production agent operates with after 30 tool calls. Your eval might not simulate the latency and retry behavior that affects agent decisions in the real system.

Every difference between your eval environment and your production environment is a source of measurement error. A pass rate that looks healthy in your eval suite can mask real fragility if the environment drifts far enough from production.

The practical fix: mirror production as closely as you can when setting up your eval environment. Same model version. Same tool configurations. Same context structure. Where you cannot use live tools, use recorded responses from actual production runs rather than hand-crafted mocks. The closer your eval environment is to production, the more you can trust the numbers.

Sim2real gap is a real cost. It compounds with every shortcut you take.

Evals as regression tests#

This is the frame that makes agent evals click for engineers who already think in tests.

Evals are regression tests for agent behavior.

When you change a prompt, you might fix the failure you observed last week. You might also break behavior that was working fine the week before. Without a regression suite running against every change, you have no way to know until users tell you.

The workflow mirrors what good engineering teams already do with code:

When you find a bug, write a test
Run the full test suite before merging
Keep the suite green as a condition of shipping

With agents, it looks like:

When you find a failure in production, write an eval case from the trace
Run the full eval suite before any prompt or model change
Track pass rate over time as your signal of agent health

The difference from code tests is the pass rate target. For code, you want 100%. For regression evals on agents, close to 100% is the right target. For capability evals (the aspirational cases testing harder tasks), a lower pass rate is expected and acceptable.

The graduation rule matters: once an agent reliably passes a capability eval, it moves into the regression suite. Now it is protected. It cannot silently regress without you knowing.

Starting small: the unit test approach#

If you are looking at a blank eval canvas right now, here is the smallest step that is worth taking.

Pick one behavior your agent must always get right. Not a full flow. One behavior. Write a test case for it. Define what passing looks like. Run it.

Now you have an eval suite. It has one case. That is fine.

Next week, check your traces. Did anything fail or surprise you? Turn it into a second case.

The suite grows one case at a time. The barrier to starting is low if you let go of the idea that your first eval suite needs to be comprehensive. It just needs to exist.

The best eval is an eval that actually exists.

That sounds obvious. I have watched teams spend three weeks debating eval design and ship with zero evals because the suite never felt complete enough to feel ready. Meanwhile, teams that started with five rough cases and grew from traces had suites in the dozens within a month.

Done beats perfect here. A rough case today is better than a perfect case never.

Spring cleaning your eval suite#

Evals are not permanent fixtures. They go stale.

Models get smarter. A capability eval that was a meaningful test six months ago might pass trivially today, which means it is adding cost and noise but no signal. An eval written for a product feature that no longer exists is pure waste. An eval that tests user behavior patterns that have since shifted is pushing your agent toward something you no longer care about.

The eval suite needs the same maintenance discipline as your codebase. Dead evals get removed. Outdated evals get updated or replaced. New product priorities get new evals.

In practice: a quarterly review. Go through the suite and ask of each case: is this still testing something that matters? Is the pass/fail threshold still meaningful given what current models can do? Does this test still reflect real user behavior?

Evals you remove are not wasted work. They did their job. Keeping them past their useful life is the waste, because they cost money to run and can degrade the quality of your eval signal by testing things you do not care about anymore.

Spring clean the suite. Keep it sharp.

How to get started today#

If you are shipping an agent and your current eval count is zero:

Pick three behaviors your agent must get right. Write one test case each.
Define grading logic for each. A simple rubric scored by an LLM judge is enough to start.
Run them. Record the baseline pass rate.
Every time you see a failure in production, add a case from the trace.
Run the full suite before any prompt change, tool update, or model swap.
Review and prune quarterly.

That is the system. It is not complicated. The hard part is the discipline of maintaining it rather than treating it as a one-time setup.

The payoff is concrete: when you swap models, you have a number before and after. When you change the prompt, you have a number. When someone asks whether the agent is better than it was last quarter, you have an answer that is not just intuition.

Without agent evals, you are flying the system on feel. With them, you are flying with instruments.

FAQ#

What is the difference between a capability eval and a regression eval?

A capability eval tests something the agent currently struggles with. It starts at a low pass rate and gives your team a hill to climb. A regression eval tests something the agent already handles reliably. It should pass at close to 100% and its job is to catch backsliding. Once an agent reliably passes a capability eval, it graduates into the regression suite.

How many evals do I need before shipping?

More than zero. Three to five focused cases covering your agent's core behaviors is a reasonable floor. The suite grows from there as production traces accumulate. Do not wait for comprehensiveness before shipping. A rough suite that exists beats a perfect suite that is still being planned.

Should I write evals before or after building the agent?

Both. A small set before building defines what done looks like and gives you an early feedback signal. But the most valuable evals come after you have production traces to learn from. Treat your initial evals as a starting point, not a complete specification.

How do I handle non-deterministic agent behavior in evals?

Run each eval case multiple times (3 to 5 runs) and look at pass rate rather than binary pass/fail. An agent that passes 4 out of 5 runs on a difficult case gives you more useful information than a single result. LLM-as-judge grading also introduces variance, so running multiple judge passes and averaging the score reduces noise.

What is the sim2real gap and why does it matter for agent evals?

The sim2real gap is the difference between your eval environment and your production environment. The bigger the gap, the less you can trust your eval numbers. Minimize it by using the same model version, the same tool configurations, the same context structure, and recorded production tool responses rather than hand-crafted mocks.

When should I remove an eval from the suite?

When it tests behavior your users no longer exhibit, when it tests a feature that no longer exists, or when it passes so trivially that it adds no real signal. Do a quarterly review. Evals past their useful life cost money to run and can push your agent toward behaviors you no longer care about.

What is LLM-as-judge and is it reliable enough to use?

LLM-as-judge means using a language model to score your agent's output against a defined rubric. It is the most practical way to evaluate open-ended agent responses at scale. It is not ground truth. The judge model has its own biases. Mitigate this by writing precise rubrics, using a stronger model as judge than the one being evaluated, and running multiple judge passes. Treat the scores as approximate, directional signals, not absolute measurements.