When we shipped our first agentic feature into production in 2025, we thought we'd been thorough. We had unit tests. We had a manual QA pass. We even had a small bench of 30 'representative' prompts our PM had hand-picked. Two weeks later, support tickets started piling up — the agent was confidently citing documents that didn't exist.
Eight months and four production agents later, we've learned that traditional testing doesn't catch the failure modes that matter. The agent isn't broken in a way that crashes. It's broken in a way that sounds correct.
The three failure modes we kept seeing
- Tool selection drift — picking the right tool 90% of the time, then silently picking a wrong-but-plausible one in the long tail.
- Citation fabrication — quoting a 'source document' that the retrieval step never returned.
- Premature termination — declaring the task complete after one tool call when three were needed.
None of these are caught by string-match assertions. They live in the shape of the trace, not in the final response. So we rebuilt our eval pipeline around traces.
What an eval harness actually needs
An agent eval harness is not a test suite. It's a graded simulator. For each scenario in our golden set, we capture: the input, the full tool-call trace, the citations returned, and the final response. Then we score each layer independently.
- Retrieval score — did the right documents come back in the top-K?
- Groundedness score — does every factual claim trace back to a retrieved document?
- Tool-trajectory score — did the agent use the expected tools in a reasonable order?
- Outcome score — did the user-facing answer actually solve the task?
We run this against 200 golden scenarios on every PR that touches the agent. A regression on any axis blocks the merge. The eval set itself is versioned, reviewed, and expanded every time we find a new failure mode in production.
What we'd do differently
If we were starting over today, we'd build the eval harness before the first agent feature shipped — not after. The eval set is the artifact that lets you iterate on prompts and models with confidence. Without it, every prompt change is a guess.
You don't need a model evaluator. You need a system evaluator — one that scores the whole pipeline, not just the LLM output.
— Internal post-mortem, Q1 2026
We'll open-source a stripped-down version of the harness later this year. In the meantime, if you're building agentic features and seeing the same patterns, the playbook is: trace everything, score each layer, and treat the golden set like production code.