RAG systems are easy to demo and hard to ship. The reason: they have three independent failure surfaces (retrieval, ranking, generation) and any one of them silently degrades the others. You need an eval that scores each layer, not just the final response.
The 200-question golden set
We seeded the set with real user queries scrubbed of PII, then expanded with synthetic variants — paraphrases, edge cases, adversarial framings. Each question is annotated with the document IDs that should appear in the top-5, a reference answer, and the set of factual claims the answer must contain.
Scoring per layer
- Retrieval recall@5 — fraction of expected documents that appear in the top-5.
- Reranker MRR — mean reciprocal rank of the first relevant document.
- Groundedness — fraction of factual claims in the response that map to a retrieved document.
- Answer accuracy — judged by a held-out LLM against the reference answer, with a human spot-check on disagreements.
The CI pipeline
Every PR that touches the embedding model, the chunking config, the retriever, or the generation prompt runs the full eval. We post the diff against main into the PR — green if all four metrics held or improved, red on regression. The PR can't merge if red, no matter how clean the unit tests are.
Two practical notes. First, the eval takes 8 minutes — keep it under 10 or developers will route around it. Second, version the golden set: a regression caused by a question rewrite is different from a regression caused by code, and you need to tell them apart.