P5-M17 - Retrieval Quality

Part 5 — RAG Systems · Module 17 of 18

Retrieval Quality

Filtering, reranking, query expansion and diagnosing why your RAG retrieval is failing

⏱ 1 Week 🟡 Intermediate 🔧 Cohere Reranker · HyDE · MMR 📋 Prerequisite: P5-M16

🎯

What This Module Covers

RAG Quality Engineering

You have a working RAG pipeline. Now you need to make it good. This module covers the techniques that separate a demo from a production system: diagnosing why retrieval fails, fixing it with pre-retrieval query improvements, adding a reranker for precision, and using HyDE for semantically difficult queries.

Failure modes — the 5 most common reasons RAG retrieval returns wrong or irrelevant chunks
Pre-retrieval improvements — query rewriting, multi-query expansion, step-back prompting
Reranking with Cohere — a cross-encoder that re-scores your top-K results for precision
HyDE — Hypothetical Document Embeddings for queries that don't match document language
MMR — Maximum Marginal Relevance to reduce redundancy in retrieved chunks
Evaluation metrics — MRR, NDCG, Hit Rate — measuring retrieval quality systematically

🚨

The 5 Most Common RAG Retrieval Failures

Diagnose First

Before applying fixes, diagnose which failure mode you have. Each requires a different solution.

Vocabulary Mismatch

User asks "how do I make packets go faster?" — docs say "throughput optimisation". Embedding similarity is low despite identical meaning.

Fix: HyDE, query rewriting, synonym expansion

Semantic Drift

Correct chunk is retrieved at rank 8 but you only return top-3. The answer exists but doesn't rank high enough.

Fix: larger top-K then rerank, better chunk size

Answer Spans Chunks

The answer requires combining information from two chunks that were split at a paragraph boundary.

Fix: increase overlap, larger chunks, parent-child chunking

Redundant Retrieval

Top-5 chunks all say the same thing from slightly different angles. The LLM gets no diverse context.

Fix: MMR (Maximum Marginal Relevance) diversity

Wrong Scope Retrieved

Query is about v2.0 of an API but retrieves chunks from v1.0 that has the same section names.

Fix: metadata filtering on version, date, source

# Diagnostic checklist — run this before adding complexity
def diagnose_retrieval(query: str, collection, expected_source: str = None):
    # 1. Retrieve top-20 instead of top-5
    results = collection.query(query_texts=[query], n_results=20,
                               include=["documents", "distances", "metadatas"])
    docs  = results["documents"][0]
    dists = results["distances"][0]

    print(f"Top-20 similarity scores: {[round(1-d,3) for d in dists]}")
    # If correct chunk is rank 8+: semantic drift → reranker
    # If all scores < 0.5: vocabulary mismatch → HyDE or query rewrite
    # If scores are clustered (0.82, 0.81, 0.80...): redundancy → MMR

    # 2. Check if expected chunk exists at all
    if expected_source:
        found = any(expected_source in m.get("source", "")
                    for m in results["metadatas"][0])
        print(f"Expected source in top-20: {found}")
        # If False and you know the doc exists: chunk too large/small → rechunk
        # If False because doc not indexed: ingestion bug

🔍

Query Rewriting — Fix Vocabulary Mismatch

Pre-Retrieval

# Query rewriting: LLM transforms user query to better match document language
REWRITE_PROMPT = """Rewrite the following user question to be more likely to
match technical documentation. Make it precise and use domain terminology.
Output only the rewritten question, nothing else.

User question: {query}

Rewritten:"""

def rewrite_query(query: str) -> str:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user",
                   "content": REWRITE_PROMPT.format(query=query)}]
    )
    return response.content[0].text.strip()

# "how do I make packets go faster?" →
# "methods to improve packet processing throughput and reduce latency in DPDK"

📋

Multi-Query Expansion — Cast a Wider Net

Recall Boost

# Generate multiple query variants → retrieve for each → merge and deduplicate
MULTI_QUERY_PROMPT = """Generate {n} different search queries that all ask about
the same topic from different angles. The queries will be used to search
technical documentation.

Original query: {query}

Output only the queries, one per line, numbered 1-{n}:"""

def multi_query_retrieve(query: str, collection, n_variants: int = 3,
                          n_results: int = 5) -> list[dict]:
    # Generate query variants
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{"role": "user",
                   "content": MULTI_QUERY_PROMPT.format(query=query, n=n_variants)}]
    )
    lines = response.content[0].text.strip().split("\n")
    queries = [query]  # include original
    for line in lines:
        q = line.lstrip("0123456789. ").strip()
        if q:
            queries.append(q)

    # Retrieve for each query, merge results by ID
    seen_ids = set()
    all_results = []
    for q in queries:
        results = collection.query(query_texts=[q], n_results=n_results,
                                   include=["documents", "distances", "metadatas", "ids"])
        for doc, dist, meta, id_ in zip(
            results["documents"][0], results["distances"][0],
            results["metadatas"][0], results["ids"][0]
        ):
            if id_ not in seen_ids:
                seen_ids.add(id_)
                all_results.append({"text": doc, "score": 1-dist, "meta": meta})

    # Sort by score and return top-K unique
    return sorted(all_results, key=lambda x: x["score"], reverse=True)

💡 Multi-query expansion is one of the cheapest quality improvements. 3-4 Haiku calls cost ~$0.001 and dramatically improve recall — especially when users phrase queries very differently from how your documents are written. LangChain ships a MultiQueryRetriever that implements this pattern.

⬆️

Step-Back Prompting — Abstract Before Searching

Concept Shift

# Step-back: ask a more general question first, retrieve those chunks,
# then use them as context for the specific question
#
# Original: "What is the max burst size for rte_ring_enqueue_burst?"
# Step-back: "How do DPDK ring buffer enqueue operations work?"
# → retrieves conceptual overview → LLM can reason to the specific answer

STEPBACK_PROMPT = """Given this specific question, write a more general version
that asks about the underlying concept or principle.

Specific: {query}

General:"""

async def step_back_retrieve(query: str, collection) -> list[dict]:
    # Generate step-back query
    response = await async_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=80,
        messages=[{"role": "user", "content": STEPBACK_PROMPT.format(query=query)}]
    )
    abstract_query = response.content[0].text.strip()

    # Retrieve for both queries concurrently
    specific_task  = asyncio.create_task(async_retrieve(query,          collection))
    abstract_task  = asyncio.create_task(async_retrieve(abstract_query, collection))
    specific, abstract = await asyncio.gather(specific_task, abstract_task)

    # Combine: abstract provides background, specific provides targeted answer
    return abstract[:2] + specific[:3]   # 2 background + 3 specific

📊

Reranking — Two-Stage Retrieval for Precision

Biggest Quality Jump

The single biggest retrieval quality improvement in most RAG systems. Embeddings are fast but approximate — they measure general semantic similarity. A reranker is a cross-encoder that reads the query AND the chunk together for a more precise relevance score.

# Two-stage retrieval:
# Stage 1 — Retrieve: fast embedding search, get top-50
# Stage 2 — Rerank:   cross-encoder scores each of the 50 precisely
# Return top-5 of the reranked 50
#
# Why not use the cross-encoder for all 50,000 chunks?
# Cross-encoders are ~100x slower — fine for 50, too slow for 50,000

pip install cohere

import cohere
co = cohere.Client()   # COHERE_API_KEY from environment

def retrieve_and_rerank(
    query: str,
    collection,
    retrieve_k: int = 50,   # retrieve many
    return_k: int = 5       # return few, best ones
) -> list[dict]:
    # Stage 1: fast vector search
    results = collection.query(
        query_texts=[query], n_results=retrieve_k,
        include=["documents", "metadatas"]
    )
    docs  = results["documents"][0]
    metas = results["metadatas"][0]

    if not docs:
        return []

    # Stage 2: Cohere reranker
    rerank_response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=return_k,
        return_documents=True
    )

    return [
        {
            "text":      hit.document.text,
            "score":     hit.relevance_score,    # 0-1, higher = more relevant
            "rank":      hit.index,              # original rank before reranking
            "meta":      metas[hit.index],
        }
        for hit in rerank_response.results
    ]

# Usage
results = retrieve_and_rerank("How does DPDK mempool work?", collection)
for r in results:
    print(f"Score: {r['score']:.3f} (was rank {r['rank']+1}) | {r['text'][:60]}")

💡 Reranking typically improves precision@5 by 15-30%. The key insight is that the embedding model ranks by general semantic similarity, but the reranker asks "given THIS query, how relevant is THIS specific chunk?" — a much harder and more accurate question. Cohere rerank-english-v3.0 is the best available cross-encoder as of 2024.

🆓

Free Reranking — Cross-Encoders with sentence-transformers

No API Cost

pip install sentence-transformers

from sentence_transformers import CrossEncoder

# Free cross-encoder models (smaller than Cohere, still effective)
# ms-marco-MiniLM-L-6-v2 — fastest, reasonable quality
# ms-marco-MiniLM-L-12-v2 — better quality, slower
# cross-encoder/ms-marco-electra-base — best free quality

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_local(query: str, docs: list[str], top_k: int = 5) -> list[tuple]:
    """Returns (score, doc) pairs sorted by relevance."""
    pairs  = [(query, doc) for doc in docs]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, docs), reverse=True)
    return ranked[:top_k]

# Use in two-stage pipeline
stage1_docs  = [r["text"] for r in stage1_results]
reranked     = rerank_local(query, stage1_docs, top_k=5)
final_chunks = [doc for score, doc in reranked]

🌀

HyDE — Hypothetical Document Embeddings

Vocabulary Bridge

HyDE solves vocabulary mismatch by generating a hypothetical document that would answer the query, then searching for real documents similar to that hypothetical. This works because the hypothetical uses the same vocabulary and style as your real documents.

# Standard search: embed query → find similar chunks
# Problem: "make packets go faster" != "throughput optimisation"
#
# HyDE search: generate a hypothetical document → embed that → find similar chunks
# "make packets go faster" → generates paragraph using "throughput", "mbps", "pps"
# → now embedding matches real doc language

HYDE_PROMPT = """Write a short technical document passage (2-3 sentences) that would
directly answer the following question. Write as if you are an expert
writing documentation. Use precise technical terminology.

Question: {query}

Technical passage:"""

def hyde_retrieve(query: str, collection, n_results: int = 5) -> list[dict]:
    # Step 1: generate hypothetical document
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{"role": "user",
                   "content": HYDE_PROMPT.format(query=query)}]
    )
    hypothetical_doc = response.content[0].text.strip()

    # Step 2: embed the hypothetical doc and search
    results = collection.query(
        query_texts=[hypothetical_doc],   # ← key: search with generated doc, not original query
        n_results=n_results,
        include=["documents", "distances", "metadatas"]
    )
    return [
        {"text": doc, "score": 1-dist, "meta": meta,
         "hypothetical": hypothetical_doc}
        for doc, dist, meta in zip(
            results["documents"][0],
            results["distances"][0],
            results["metadatas"][0]
        )
    ]

# Hybrid: retrieve with both original query and HyDE, merge
def hybrid_hyde(query: str, collection, n_results: int = 5) -> list[dict]:
    standard = collection.query(query_texts=[query], n_results=n_results,
                                include=["documents", "distances", "metadatas", "ids"])
    hyde_res  = hyde_retrieve(query, collection, n_results=n_results)

    # Merge unique results, original query results get slight preference
    seen = set()
    merged = []
    for r in hyde_res:
        if r["text"] not in seen:
            seen.add(r["text"])
            merged.append(r)
    for doc, dist, meta in zip(standard["documents"][0],
                                standard["distances"][0],
                                standard["metadatas"][0]):
        if doc not in seen:
            seen.add(doc)
            merged.append({"text": doc, "score": 1-dist, "meta": meta})
    return sorted(merged, key=lambda x: x["score"], reverse=True)[:n_results]

⚠️ HyDE adds hallucination risk. If the hypothetical document is factually wrong, you retrieve chunks similar to wrong information. Always use HyDE as an additional retrieval path (hybrid), never as the sole retrieval method. Rerank afterwards to surface the truly relevant chunks.

🎯

MMR — Maximum Marginal Relevance

Diversity

MMR balances relevance and diversity. Without it, your top-5 chunks might all be near-identical paragraphs from the same section. MMR ensures each selected chunk adds new information.

import numpy as np

def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))

def mmr(
    query_vec: list[float],
    candidate_vecs: list[list[float]],
    candidate_docs: list[str],
    top_k: int = 5,
    lambda_param: float = 0.5   # 0=max diversity, 1=max relevance
) -> list[str]:
    """
    Maximum Marginal Relevance selection.
    Iteratively picks the candidate that maximises:
        lambda * similarity(query, doc) - (1-lambda) * max_similarity(doc, selected)
    """
    selected_idx   = []
    selected_vecs  = []
    remaining_idx  = list(range(len(candidate_docs)))

    for _ in range(min(top_k, len(candidate_docs))):
        best_score, best_idx = -1, -1

        for idx in remaining_idx:
            relevance = cosine_sim(query_vec, candidate_vecs[idx])

            if not selected_vecs:
                redundancy = 0
            else:
                redundancy = max(cosine_sim(candidate_vecs[idx], sv)
                                 for sv in selected_vecs)

            score = lambda_param * relevance - (1 - lambda_param) * redundancy
            if score > best_score:
                best_score, best_idx = score, idx

        selected_idx.append(best_idx)
        selected_vecs.append(candidate_vecs[best_idx])
        remaining_idx.remove(best_idx)

    return [candidate_docs[i] for i in selected_idx]

# Practical example
# 1. Retrieve top-20 with embeddings
# 2. Apply MMR to select 5 diverse chunks
# 3. Pass to LLM — it now has diverse context, not 5 copies of the same info

💡 lambda_param tuning: For factual Q&A where precision matters, use lambda=0.7 (favour relevance). For open-ended research questions where you want broad coverage, use lambda=0.3 (favour diversity). ChromaDB's query() does not natively support MMR — implement it as a post-retrieval step on the returned vectors.

📏

Measuring Retrieval Quality — MRR, Hit Rate, NDCG

Systematic Evaluation

# Build a test set: queries + expected source chunks
test_set = [
    {"query": "How does DPDK mempool initialisation work?",
     "expected_source": "dpdk_guide.pdf",
     "expected_section": "Memory Management"},
    {"query": "What is the rte_ring burst size limit?",
     "expected_source": "dpdk_guide.pdf",
     "expected_section": "Ring Library"},
    # ... 20+ test cases
]

def hit_rate(results: list[dict], expected_source: str, k: int = 5) -> float:
    """1 if expected source appears in top-k, else 0."""
    top_k = results[:k]
    return 1.0 if any(expected_source in r["meta"].get("source", "")
                         for r in top_k) else 0.0

def mrr(results: list[dict], expected_source: str) -> float:
    """Mean Reciprocal Rank — higher rank = higher score."""
    for i, r in enumerate(results):
        if expected_source in r["meta"].get("source", ""):
            return 1.0 / (i + 1)
    return 0.0

def evaluate_pipeline(retrieval_fn, test_set: list[dict], k: int = 5) -> dict:
    hit_rates, mrrs = [], []
    for test in test_set:
        results = retrieval_fn(test["query"])
        hit_rates.append(hit_rate(results, test["expected_source"], k))
        mrrs.append(mrr(results, test["expected_source"]))

    return {
        f"hit_rate@{k}": round(sum(hit_rates) / len(hit_rates), 3),
        "mrr":         round(sum(mrrs) / len(mrrs), 3),
        "n_queries":   len(test_set),
    }

# Compare pipelines
baseline = evaluate_pipeline(lambda q: basic_retrieve(q), test_set)
reranked = evaluate_pipeline(lambda q: retrieve_and_rerank(q, collection), test_set)
hyde_res = evaluate_pipeline(lambda q: hyde_retrieve(q, collection), test_set)

print(f"Baseline:  {baseline}")  # {"hit_rate@5": 0.65, "mrr": 0.48}
print(f"Reranked:  {reranked}")  # {"hit_rate@5": 0.82, "mrr": 0.67}
print(f"HyDE:      {hyde_res}")  # {"hit_rate@5": 0.74, "mrr": 0.55}

FREE LEARNING RESOURCES

Type	Resource	Best For
Docs	Cohere Reranking Guide — docs.cohere.com/docs/reranking-with-cohere	Official Cohere reranker documentation with API reference and best practices.
Guide	Pinecone: Improving Retrieval Quality — pinecone.io/learn	Practical guide covering common RAG failure modes and fixes including reranking and HyDE.
Docs	LangChain: Query Transformations — python.langchain.com	Query rewriting, step-back prompting, and HyDE implementation in LangChain.
Article	Anthropic: Contextual Retrieval — anthropic.com	Covers BM25 hybrid search + reranking combination for best retrieval quality.

🛠 Retrieval Quality Benchmark — 4 Pipelines Compared [Intermediate] 3–4 days

Build and benchmark 4 retrieval pipelines on the same document collection and test set. This is the experiment you would run before choosing a retrieval strategy for production.

Requirements

Use the document collection from M16. Write a 20-question test set with ground truth sources.
Pipeline 1 — Baseline: simple vector search, top-5
Pipeline 2 — Multi-query: 3 query variants, deduplicated results
Pipeline 3 — Reranked: top-50 vector search → Cohere rerank → top-5
Pipeline 4 — HyDE + Rerank: hypothetical doc search → Cohere rerank → top-5
Evaluate all 4 on hit_rate@5, mrr, and avg query latency
Present findings: which pipeline wins? What is the cost per query for each?

Skills: Cohere reranker, multi-query expansion, HyDE, MRR/hit-rate evaluation, cost analysis

LAB 1

Diagnose a Failing RAG System

Objective: Apply the diagnostic framework to identify which failure mode you have — before guessing at fixes.

Take your M16 collection. Find 3 queries where the baseline retrieval clearly fails (answer not in top-5). Log the failure for each.

For each failure, run the diagnostic: retrieve top-20, print similarity scores. Classify: vocabulary mismatch (<0.5 scores), semantic drift (correct at rank 8+), redundancy (scores clustered), span issue, or wrong scope.

Apply the matching fix for each failure mode. Verify the fix improved retrieval for that specific query.

Check for regressions: did the fix break any previously working queries? Document the trade-off.

LAB 2

Reranker — Measure the Precision Jump

Objective: Quantify exactly how much reranking improves precision on your specific document collection.

Write 15 test queries with ground truth sources. Run baseline (top-5 vector search). Score hit_rate@5 and MRR.

Run two-stage: top-50 vector search → Cohere rerank → top-5. Score the same metrics.

For queries where reranking changed the rank ordering significantly, inspect the before/after. Why did the reranker move those chunks up or down?

Calculate cost per query: embedding cost (stage 1) + reranking cost (stage 2). At what query volume does the cost become significant?

LAB 3

HyDE vs Standard — When Does It Help?

Objective: Identify which types of queries benefit most from HyDE.

Create 3 categories of test queries: (a) 5 queries using exact document vocabulary ("rte_mempool initialisation"), (b) 5 queries using layman language ("make memory faster"), (c) 5 conceptual queries ("why does DPDK avoid kernel").

Run standard retrieval and HyDE on all 15. Record hit_rate@5 per category for each method.

For each category, compare: standard vs HyDE. Which query type benefits most from HyDE?

Document your conclusion: when to activate HyDE, when to skip it (and why it adds unnecessary latency and cost for queries that already match well).

P5-M17 MASTERY CHECKLIST

Can name the 5 RAG retrieval failure modes and identify which one is occurring from diagnostic signals
Know that similarity scores below 0.5 indicate vocabulary mismatch, not bad chunking
Can implement query rewriting using a cheap LLM to match document vocabulary
Can implement multi-query expansion: generate N variants, retrieve for each, deduplicate by ID
Can implement step-back prompting: abstract the query first, retrieve background context
Understand two-stage retrieval: retrieve large K with embeddings, rerank to small K with cross-encoder
Can implement Cohere reranking with retrieve_k=50 and return_k=5
Can implement free local reranking with sentence-transformers CrossEncoder
Know that rerankers read query+chunk together (cross-encoder) vs embeddings which are independent (bi-encoder)
Can explain HyDE: generate hypothetical document → embed it → search for real similar documents
Know that HyDE adds hallucination risk and should always be used as a hybrid, not sole retrieval path
Can implement MMR to reduce redundancy in retrieved chunks
Know lambda_param for MMR: 0.7 for factual Q&A (relevance), 0.3 for research (diversity)
Can build a test set with ground truth and compute hit_rate@K and MRR
Can run an evaluation comparing multiple retrieval pipelines and make a cost/quality trade-off decision
Completed Lab 1: diagnosed failing RAG system with failure mode classification
Completed Lab 2: reranker precision measurement
Completed Lab 3: HyDE vs standard query type analysis
Milestone project: 4-pipeline benchmark pushed to GitHub with findings

✅ When complete: Move to P5-M18 — RAG Pipelines, Grounding & Hallucination Reduction. You now have excellent retrieval. M18 covers combining retrieval with LLM generation into a complete, production-grade RAG system.

← P5-M16: Chunking 🗺️ All Modules Next: P5-M18 — RAG Pipelines →