What This Module Covers
RAG Quality EngineeringYou have a working RAG pipeline. Now you need to make it good. This module covers the techniques that separate a demo from a production system: diagnosing why retrieval fails, fixing it with pre-retrieval query improvements, adding a reranker for precision, and using HyDE for semantically difficult queries.
- Failure modes — the 5 most common reasons RAG retrieval returns wrong or irrelevant chunks
- Pre-retrieval improvements — query rewriting, multi-query expansion, step-back prompting
- Reranking with Cohere — a cross-encoder that re-scores your top-K results for precision
- HyDE — Hypothetical Document Embeddings for queries that don't match document language
- MMR — Maximum Marginal Relevance to reduce redundancy in retrieved chunks
- Evaluation metrics — MRR, NDCG, Hit Rate — measuring retrieval quality systematically
The 5 Most Common RAG Retrieval Failures
Diagnose FirstBefore applying fixes, diagnose which failure mode you have. Each requires a different solution.
Vocabulary Mismatch
User asks "how do I make packets go faster?" — docs say "throughput optimisation". Embedding similarity is low despite identical meaning.
Semantic Drift
Correct chunk is retrieved at rank 8 but you only return top-3. The answer exists but doesn't rank high enough.
Answer Spans Chunks
The answer requires combining information from two chunks that were split at a paragraph boundary.
Redundant Retrieval
Top-5 chunks all say the same thing from slightly different angles. The LLM gets no diverse context.
# Diagnostic checklist — run this before adding complexity def diagnose_retrieval(query: str, collection, expected_source: str = None): # 1. Retrieve top-20 instead of top-5 results = collection.query(query_texts=[query], n_results=20, include=["documents", "distances", "metadatas"]) docs = results["documents"][0] dists = results["distances"][0] print(f"Top-20 similarity scores: {[round(1-d,3) for d in dists]}") # If correct chunk is rank 8+: semantic drift → reranker # If all scores < 0.5: vocabulary mismatch → HyDE or query rewrite # If scores are clustered (0.82, 0.81, 0.80...): redundancy → MMR # 2. Check if expected chunk exists at all if expected_source: found = any(expected_source in m.get("source", "") for m in results["metadatas"][0]) print(f"Expected source in top-20: {found}") # If False and you know the doc exists: chunk too large/small → rechunk # If False because doc not indexed: ingestion bug
Query Rewriting — Fix Vocabulary Mismatch
Pre-Retrieval# Query rewriting: LLM transforms user query to better match document language REWRITE_PROMPT = """Rewrite the following user question to be more likely to match technical documentation. Make it precise and use domain terminology. Output only the rewritten question, nothing else. User question: {query} Rewritten:""" def rewrite_query(query: str) -> str: response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": REWRITE_PROMPT.format(query=query)}] ) return response.content[0].text.strip() # "how do I make packets go faster?" → # "methods to improve packet processing throughput and reduce latency in DPDK"
Multi-Query Expansion — Cast a Wider Net
Recall Boost# Generate multiple query variants → retrieve for each → merge and deduplicate MULTI_QUERY_PROMPT = """Generate {n} different search queries that all ask about the same topic from different angles. The queries will be used to search technical documentation. Original query: {query} Output only the queries, one per line, numbered 1-{n}:""" def multi_query_retrieve(query: str, collection, n_variants: int = 3, n_results: int = 5) -> list[dict]: # Generate query variants response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=200, messages=[{"role": "user", "content": MULTI_QUERY_PROMPT.format(query=query, n=n_variants)}] ) lines = response.content[0].text.strip().split("\n") queries = [query] # include original for line in lines: q = line.lstrip("0123456789. ").strip() if q: queries.append(q) # Retrieve for each query, merge results by ID seen_ids = set() all_results = [] for q in queries: results = collection.query(query_texts=[q], n_results=n_results, include=["documents", "distances", "metadatas", "ids"]) for doc, dist, meta, id_ in zip( results["documents"][0], results["distances"][0], results["metadatas"][0], results["ids"][0] ): if id_ not in seen_ids: seen_ids.add(id_) all_results.append({"text": doc, "score": 1-dist, "meta": meta}) # Sort by score and return top-K unique return sorted(all_results, key=lambda x: x["score"], reverse=True)
💡 Multi-query expansion is one of the cheapest quality improvements. 3-4 Haiku calls cost ~$0.001 and dramatically improve recall — especially when users phrase queries very differently from how your documents are written. LangChain ships a MultiQueryRetriever that implements this pattern.
Step-Back Prompting — Abstract Before Searching
Concept Shift# Step-back: ask a more general question first, retrieve those chunks, # then use them as context for the specific question # # Original: "What is the max burst size for rte_ring_enqueue_burst?" # Step-back: "How do DPDK ring buffer enqueue operations work?" # → retrieves conceptual overview → LLM can reason to the specific answer STEPBACK_PROMPT = """Given this specific question, write a more general version that asks about the underlying concept or principle. Specific: {query} General:""" async def step_back_retrieve(query: str, collection) -> list[dict]: # Generate step-back query response = await async_client.messages.create( model="claude-3-haiku-20240307", max_tokens=80, messages=[{"role": "user", "content": STEPBACK_PROMPT.format(query=query)}] ) abstract_query = response.content[0].text.strip() # Retrieve for both queries concurrently specific_task = asyncio.create_task(async_retrieve(query, collection)) abstract_task = asyncio.create_task(async_retrieve(abstract_query, collection)) specific, abstract = await asyncio.gather(specific_task, abstract_task) # Combine: abstract provides background, specific provides targeted answer return abstract[:2] + specific[:3] # 2 background + 3 specific
Reranking — Two-Stage Retrieval for Precision
Biggest Quality JumpThe single biggest retrieval quality improvement in most RAG systems. Embeddings are fast but approximate — they measure general semantic similarity. A reranker is a cross-encoder that reads the query AND the chunk together for a more precise relevance score.
# Two-stage retrieval: # Stage 1 — Retrieve: fast embedding search, get top-50 # Stage 2 — Rerank: cross-encoder scores each of the 50 precisely # Return top-5 of the reranked 50 # # Why not use the cross-encoder for all 50,000 chunks? # Cross-encoders are ~100x slower — fine for 50, too slow for 50,000 pip install cohere import cohere co = cohere.Client() # COHERE_API_KEY from environment def retrieve_and_rerank( query: str, collection, retrieve_k: int = 50, # retrieve many return_k: int = 5 # return few, best ones ) -> list[dict]: # Stage 1: fast vector search results = collection.query( query_texts=[query], n_results=retrieve_k, include=["documents", "metadatas"] ) docs = results["documents"][0] metas = results["metadatas"][0] if not docs: return [] # Stage 2: Cohere reranker rerank_response = co.rerank( model="rerank-english-v3.0", query=query, documents=docs, top_n=return_k, return_documents=True ) return [ { "text": hit.document.text, "score": hit.relevance_score, # 0-1, higher = more relevant "rank": hit.index, # original rank before reranking "meta": metas[hit.index], } for hit in rerank_response.results ] # Usage results = retrieve_and_rerank("How does DPDK mempool work?", collection) for r in results: print(f"Score: {r['score']:.3f} (was rank {r['rank']+1}) | {r['text'][:60]}")
💡 Reranking typically improves precision@5 by 15-30%. The key insight is that the embedding model ranks by general semantic similarity, but the reranker asks "given THIS query, how relevant is THIS specific chunk?" — a much harder and more accurate question. Cohere rerank-english-v3.0 is the best available cross-encoder as of 2024.
Free Reranking — Cross-Encoders with sentence-transformers
No API Costpip install sentence-transformers from sentence_transformers import CrossEncoder # Free cross-encoder models (smaller than Cohere, still effective) # ms-marco-MiniLM-L-6-v2 — fastest, reasonable quality # ms-marco-MiniLM-L-12-v2 — better quality, slower # cross-encoder/ms-marco-electra-base — best free quality reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") def rerank_local(query: str, docs: list[str], top_k: int = 5) -> list[tuple]: """Returns (score, doc) pairs sorted by relevance.""" pairs = [(query, doc) for doc in docs] scores = reranker.predict(pairs) ranked = sorted(zip(scores, docs), reverse=True) return ranked[:top_k] # Use in two-stage pipeline stage1_docs = [r["text"] for r in stage1_results] reranked = rerank_local(query, stage1_docs, top_k=5) final_chunks = [doc for score, doc in reranked]
HyDE — Hypothetical Document Embeddings
Vocabulary BridgeHyDE solves vocabulary mismatch by generating a hypothetical document that would answer the query, then searching for real documents similar to that hypothetical. This works because the hypothetical uses the same vocabulary and style as your real documents.
# Standard search: embed query → find similar chunks # Problem: "make packets go faster" != "throughput optimisation" # # HyDE search: generate a hypothetical document → embed that → find similar chunks # "make packets go faster" → generates paragraph using "throughput", "mbps", "pps" # → now embedding matches real doc language HYDE_PROMPT = """Write a short technical document passage (2-3 sentences) that would directly answer the following question. Write as if you are an expert writing documentation. Use precise technical terminology. Question: {query} Technical passage:""" def hyde_retrieve(query: str, collection, n_results: int = 5) -> list[dict]: # Step 1: generate hypothetical document response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=200, messages=[{"role": "user", "content": HYDE_PROMPT.format(query=query)}] ) hypothetical_doc = response.content[0].text.strip() # Step 2: embed the hypothetical doc and search results = collection.query( query_texts=[hypothetical_doc], # ← key: search with generated doc, not original query n_results=n_results, include=["documents", "distances", "metadatas"] ) return [ {"text": doc, "score": 1-dist, "meta": meta, "hypothetical": hypothetical_doc} for doc, dist, meta in zip( results["documents"][0], results["distances"][0], results["metadatas"][0] ) ] # Hybrid: retrieve with both original query and HyDE, merge def hybrid_hyde(query: str, collection, n_results: int = 5) -> list[dict]: standard = collection.query(query_texts=[query], n_results=n_results, include=["documents", "distances", "metadatas", "ids"]) hyde_res = hyde_retrieve(query, collection, n_results=n_results) # Merge unique results, original query results get slight preference seen = set() merged = [] for r in hyde_res: if r["text"] not in seen: seen.add(r["text"]) merged.append(r) for doc, dist, meta in zip(standard["documents"][0], standard["distances"][0], standard["metadatas"][0]): if doc not in seen: seen.add(doc) merged.append({"text": doc, "score": 1-dist, "meta": meta}) return sorted(merged, key=lambda x: x["score"], reverse=True)[:n_results]
⚠️ HyDE adds hallucination risk. If the hypothetical document is factually wrong, you retrieve chunks similar to wrong information. Always use HyDE as an additional retrieval path (hybrid), never as the sole retrieval method. Rerank afterwards to surface the truly relevant chunks.
MMR — Maximum Marginal Relevance
DiversityMMR balances relevance and diversity. Without it, your top-5 chunks might all be near-identical paragraphs from the same section. MMR ensures each selected chunk adds new information.
import numpy as np
def cosine_sim(a, b):
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))
def mmr(
query_vec: list[float],
candidate_vecs: list[list[float]],
candidate_docs: list[str],
top_k: int = 5,
lambda_param: float = 0.5 # 0=max diversity, 1=max relevance
) -> list[str]:
"""
Maximum Marginal Relevance selection.
Iteratively picks the candidate that maximises:
lambda * similarity(query, doc) - (1-lambda) * max_similarity(doc, selected)
"""
selected_idx = []
selected_vecs = []
remaining_idx = list(range(len(candidate_docs)))
for _ in range(min(top_k, len(candidate_docs))):
best_score, best_idx = -1, -1
for idx in remaining_idx:
relevance = cosine_sim(query_vec, candidate_vecs[idx])
if not selected_vecs:
redundancy = 0
else:
redundancy = max(cosine_sim(candidate_vecs[idx], sv)
for sv in selected_vecs)
score = lambda_param * relevance - (1 - lambda_param) * redundancy
if score > best_score:
best_score, best_idx = score, idx
selected_idx.append(best_idx)
selected_vecs.append(candidate_vecs[best_idx])
remaining_idx.remove(best_idx)
return [candidate_docs[i] for i in selected_idx]
# Practical example
# 1. Retrieve top-20 with embeddings
# 2. Apply MMR to select 5 diverse chunks
# 3. Pass to LLM — it now has diverse context, not 5 copies of the same info💡 lambda_param tuning: For factual Q&A where precision matters, use lambda=0.7 (favour relevance). For open-ended research questions where you want broad coverage, use lambda=0.3 (favour diversity). ChromaDB's query() does not natively support MMR — implement it as a post-retrieval step on the returned vectors.
Measuring Retrieval Quality — MRR, Hit Rate, NDCG
Systematic Evaluation# Build a test set: queries + expected source chunks test_set = [ {"query": "How does DPDK mempool initialisation work?", "expected_source": "dpdk_guide.pdf", "expected_section": "Memory Management"}, {"query": "What is the rte_ring burst size limit?", "expected_source": "dpdk_guide.pdf", "expected_section": "Ring Library"}, # ... 20+ test cases ] def hit_rate(results: list[dict], expected_source: str, k: int = 5) -> float: """1 if expected source appears in top-k, else 0.""" top_k = results[:k] return 1.0 if any(expected_source in r["meta"].get("source", "") for r in top_k) else 0.0 def mrr(results: list[dict], expected_source: str) -> float: """Mean Reciprocal Rank — higher rank = higher score.""" for i, r in enumerate(results): if expected_source in r["meta"].get("source", ""): return 1.0 / (i + 1) return 0.0 def evaluate_pipeline(retrieval_fn, test_set: list[dict], k: int = 5) -> dict: hit_rates, mrrs = [], [] for test in test_set: results = retrieval_fn(test["query"]) hit_rates.append(hit_rate(results, test["expected_source"], k)) mrrs.append(mrr(results, test["expected_source"])) return { f"hit_rate@{k}": round(sum(hit_rates) / len(hit_rates), 3), "mrr": round(sum(mrrs) / len(mrrs), 3), "n_queries": len(test_set), } # Compare pipelines baseline = evaluate_pipeline(lambda q: basic_retrieve(q), test_set) reranked = evaluate_pipeline(lambda q: retrieve_and_rerank(q, collection), test_set) hyde_res = evaluate_pipeline(lambda q: hyde_retrieve(q, collection), test_set) print(f"Baseline: {baseline}") # {"hit_rate@5": 0.65, "mrr": 0.48} print(f"Reranked: {reranked}") # {"hit_rate@5": 0.82, "mrr": 0.67} print(f"HyDE: {hyde_res}") # {"hit_rate@5": 0.74, "mrr": 0.55}
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Docs | Cohere Reranking Guide — docs.cohere.com/docs/reranking-with-cohere | Official Cohere reranker documentation with API reference and best practices. |
| Guide | Pinecone: Improving Retrieval Quality — pinecone.io/learn | Practical guide covering common RAG failure modes and fixes including reranking and HyDE. |
| Docs | LangChain: Query Transformations — python.langchain.com | Query rewriting, step-back prompting, and HyDE implementation in LangChain. |
| Article | Anthropic: Contextual Retrieval — anthropic.com | Covers BM25 hybrid search + reranking combination for best retrieval quality. |
Build and benchmark 4 retrieval pipelines on the same document collection and test set. This is the experiment you would run before choosing a retrieval strategy for production.
Requirements
- Use the document collection from M16. Write a 20-question test set with ground truth sources.
- Pipeline 1 — Baseline: simple vector search, top-5
- Pipeline 2 — Multi-query: 3 query variants, deduplicated results
- Pipeline 3 — Reranked: top-50 vector search → Cohere rerank → top-5
- Pipeline 4 — HyDE + Rerank: hypothetical doc search → Cohere rerank → top-5
- Evaluate all 4 on hit_rate@5, mrr, and avg query latency
- Present findings: which pipeline wins? What is the cost per query for each?
Skills: Cohere reranker, multi-query expansion, HyDE, MRR/hit-rate evaluation, cost analysis
Diagnose a Failing RAG System
Objective: Apply the diagnostic framework to identify which failure mode you have — before guessing at fixes.
Reranker — Measure the Precision Jump
Objective: Quantify exactly how much reranking improves precision on your specific document collection.
HyDE vs Standard — When Does It Help?
Objective: Identify which types of queries benefit most from HyDE.
P5-M17 MASTERY CHECKLIST
- Can name the 5 RAG retrieval failure modes and identify which one is occurring from diagnostic signals
- Know that similarity scores below 0.5 indicate vocabulary mismatch, not bad chunking
- Can implement query rewriting using a cheap LLM to match document vocabulary
- Can implement multi-query expansion: generate N variants, retrieve for each, deduplicate by ID
- Can implement step-back prompting: abstract the query first, retrieve background context
- Understand two-stage retrieval: retrieve large K with embeddings, rerank to small K with cross-encoder
- Can implement Cohere reranking with retrieve_k=50 and return_k=5
- Can implement free local reranking with sentence-transformers CrossEncoder
- Know that rerankers read query+chunk together (cross-encoder) vs embeddings which are independent (bi-encoder)
- Can explain HyDE: generate hypothetical document → embed it → search for real similar documents
- Know that HyDE adds hallucination risk and should always be used as a hybrid, not sole retrieval path
- Can implement MMR to reduce redundancy in retrieved chunks
- Know lambda_param for MMR: 0.7 for factual Q&A (relevance), 0.3 for research (diversity)
- Can build a test set with ground truth and compute hit_rate@K and MRR
- Can run an evaluation comparing multiple retrieval pipelines and make a cost/quality trade-off decision
- Completed Lab 1: diagnosed failing RAG system with failure mode classification
- Completed Lab 2: reranker precision measurement
- Completed Lab 3: HyDE vs standard query type analysis
- Milestone project: 4-pipeline benchmark pushed to GitHub with findings
✅ When complete: Move to P5-M18 — RAG Pipelines, Grounding & Hallucination Reduction. You now have excellent retrieval. M18 covers combining retrieval with LLM generation into a complete, production-grade RAG system.