From Toy to Truth: Building Production-Grade RAG

Prefer to watch or listen? ▶ YouTube ♫ Spotify ✈ Telegram

A RAG demo is the easiest thing in AI to build and the hardest thing to trust. You chunk a few PDFs, drop them in a vector store, wire up top_k=5, and the answers look great — on the 100 documents you tested. Then you point it at the real corpus and it starts confidently citing the wrong paragraph, or three copies of the same paragraph, and you have no idea why.

The gap between the toy and the truth isn't a better model. It's measurement and ranking — two things almost nobody adds until production embarrasses them. This is the part of RAG that doesn't fit in a quickstart.

The silent bottleneck: redundancy

Start with the failure that hides best. The default RAG pipeline sorts candidates by vector similarity and stops there. That feels right — return the closest matches — but "closest" and "most useful" are different objectives. When five chunks all say nearly the same thing, similarity sort happily returns all five.

Now your context window is full and your LLM has seen exactly one fact, repeated. It can't synthesize what it was never given. The answer comes back thin or wrong, and the retrieval scores look fine, because each individual chunk really was relevant. Redundancy is the bug that passes every per-chunk check. You can't fix what you don't measure, so measurement comes first.

Layer 1 and Layer 2: measure what you find, then what you write

The single most useful move in production RAG is to split evaluation into two questions that fail independently:

Retrieval — did we fetch the right context? This is a search problem, judged before the LLM ever runs.
Generation — did the model use that context correctly? This is a grounding problem, judged on the final answer.

Conflating them is why teams stay stuck. A bad answer could mean retrieval handed over garbage, or retrieval was perfect and the model ignored it. Those have opposite fixes. Separate them and every failure points at its own cause.

The retrieval metrics that matter

Context Precision — is the genuinely useful context near the top of the list? Rank matters; an LLM weights early context more.
Context Recall — did you retrieve all the information the answer needs? Miss one required chunk and the model is guessing.

The diagnostic that earns its keep:

High precision + low recall = your chunks are accurate but incomplete. You're returning correct fragments and missing the bigger picture. The fix is mechanical: larger chunks, or a higher top_k. High recall + low precision = you're pulling the right info buried in noise — that's a ranking problem, which the funnel below solves.

The generation metric that matters

Faithfulness — is every claim in the answer grounded in the retrieved context, or did the model improvise? This is your hallucination gate. An unfaithful answer is a liability even when it happens to be correct, because next time it won't be.

Round it out with answer relevance and a few others and you land at roughly ten metrics — but precision, recall, and faithfulness are the three that catch most production fires.

The ranking funnel: three stages, three jobs

Once you accept that "sort by cosine similarity" is the bug, the fix is an industry-standard funnel. Each stage trades volume for precision.

1. Retrieval — cast a wide net. Fast bi-encoders (your vector search) reduce millions of documents to a high-recall candidate set of ~1,000. The job here is recall: don't lose the right answer. Speed over precision.

2. Scoring — judge each candidate properly. A cross-encoder reranker reads the query and each candidate together and assigns a real relevance score. Bi-encoders embed query and document separately and hope the geometry lines up; cross-encoders actually compare them. Pointwise: "how good is this one item?" Slow, so you only run it on the ~1,000 survivors, not the millions.

3. Ordering — pick the best set. The stage everyone skips. Now ask a listwise question: "how good is this group of items together?" This is where you inject diversity so the LLM gets multi-faceted coverage instead of five clones of the top hit.

millions  ──bi-encoder──▶  ~1,000  ──cross-encoder──▶  ~50  ──diversity──▶  top 5
 (recall)                 (precision)                      (coverage)      → LLM

The RAG ranking funnel: retrieval, scoring, and ordering

Most pipelines do stage 1 and call it done. Adding stage 2 is a big jump in answer quality. Stage 3 is what kills the redundancy bottleneck.

Diversity in the ordering stage: MMR vs VRSD

Two algorithms own this stage.

Maximal Marginal Relevance (MMR) — the standard

MMR builds the result set greedily. At each step it picks the candidate that maximizes a tunable balance between relevance to the query and novelty versus what you've already selected:

def mmr(query, candidates, k, lam=0.7):
    selected = []
    while len(selected) < k:
        best = max(
            (c for c in candidates if c not in selected),
            key=lambda c: lam * sim(c, query)
                          - (1 - lam) * max((sim(c, s) for s in selected), default=0),
        )
        selected.append(best)
    return selected

The λ knob is the whole story:

λ = 1 — pure relevance. You're back to plain similarity sort, redundancy and all.
λ = 0 — pure diversity. Maximally varied, often maximally irrelevant.
λ ≈ 0.7 — the working sweet spot: relevance-led, with a penalty that quietly drops the near-duplicates.

This isn't theory for me. The "Ask the blog" RAG that powers the search on this site runs MMR at λ = 0.6 — relevance-forward, but diverse enough that a question pulls in several distinct posts instead of three chunks from the same one. The knob is real and you will tune it.

VRSD — the parameter-free challenger

MMR's weakness is that λ: someone has to pick it, and the right value drifts by corpus and query. VRSD (Vector Retrieval with Similarity and Diversity) removes the dial. Instead of scoring items one at a time, it maximizes the similarity between the query and the sum of all selected vectors.

The intuition is the famous king − man + woman ≈ queen arithmetic. If the combined "sum vector" of your selection has to point at the query, then each chosen vector has to approach the query from a different direction to cover it jointly — which geometrically forces diversity without a tuning parameter. You get joint semantic coverage for free, at the cost of a less familiar objective.

Practical read: reach for MMR when you want a known quantity and don't mind tuning one number; watch VRSD when per-corpus λ tuning becomes a maintenance tax.

What "good" looks like: benchmark before you ship

Diversity and rerankers are pointless if you can't tell whether they helped. Set targets, then measure against them:

Standard enterprise apps: aim for ≥ 0.80 on Faithfulness and Context Recall.
High-risk domains (healthcare, legal, compliance): a 20% error margin is unacceptable — target ≥ 0.95. The cost of a confident wrong answer is the whole reason the system needs to be measurable.

And use the right tool at the right stage instead of one framework everywhere:

RAGAS — fast metrics for R&D, where you're iterating on chunking and top_k.
DeepEval — assertion-style evals that drop into CI/CD, so a regression fails the build instead of the customer.
Patronus AI — real-time monitoring once you're live, catching the drift that offline tests never see.

The one-line version

A toy RAG retrieves. A production RAG measures what it retrieves, reranks it properly, and refuses to hand the model five copies of the same fact. The model was never the hard part — the pipeline around it is. Split your evals into retrieval and generation, add the cross-encoder and the diversity stage, and put a number on "good" before you call it done.