Building a RAG eval framework with 200 golden questions

RAG systems are easy to demo and hard to ship. The reason: they have three independent failure surfaces (retrieval, ranking, generation) and any one of them silently degrades the others. You need an eval that scores each layer, not just the final response.

The 200-question golden set

We seeded the set with real user queries scrubbed of PII, then expanded with synthetic variants — paraphrases, edge cases, adversarial framings. Each question is annotated with the document IDs that should appear in the top-5, a reference answer, and the set of factual claims the answer must contain.

Scoring per layer

Retrieval recall@5 — fraction of expected documents that appear in the top-5.
Reranker MRR — mean reciprocal rank of the first relevant document.
Groundedness — fraction of factual claims in the response that map to a retrieved document.
Answer accuracy — judged by a held-out LLM against the reference answer, with a human spot-check on disagreements.

The CI pipeline

Every PR that touches the embedding model, the chunking config, the retriever, or the generation prompt runs the full eval. We post the diff against main into the PR — green if all four metrics held or improved, red on regression. The PR can't merge if red, no matter how clean the unit tests are.

Two practical notes. First, the eval takes 8 minutes — keep it under 10 or developers will route around it. Second, version the golden set: a regression caused by a question rewrite is different from a regression caused by code, and you need to tell them apart.

The 200-question golden set

Scoring per layer

The CI pipeline

Keep reading

Why your AI agent keeps lying — and how we fixed it with eval harnesses

React Server Components, two years in: the patterns that survived

Postgres index bloat: a debugging story (with a happy ending)

Get the field notes in your inbox.