What This Module Covers
Final Part 6 ModuleYou cannot improve what you cannot measure. Evaluation is what separates teams that ship reliable AI systems from teams that rely on vibes and hope. This module covers the full evaluation stack — from simple metrics you implement yourself to production-grade harnesses.
- Key metrics — faithfulness, answer relevancy, context recall, task success rate, tool precision
- LLM-as-judge — using an LLM to evaluate LLM outputs, calibration, and known biases
- RAG evaluation — RAGAS framework: faithfulness, answer relevancy, context precision, context recall
- Agent evaluation — task success rate, tool call efficiency, trajectory accuracy
- DeepEval & Ragas — production eval frameworks with built-in metrics
- LangSmith — tracing, datasets, evaluation runs, regression testing
The Metrics That Matter
Know These by NameRAG METRICS
AGENT METRICS
GENERAL LLM METRICS
LLM-as-Judge — Using AI to Evaluate AI
Core TechniqueWhen you can't write deterministic evaluation rules (most LLM outputs), use a capable LLM as the judge. The key is calibration — your judge must agree with human raters on a validation set.
import anthropic
from pydantic import BaseModel
import instructor
judge_client = instructor.from_anthropic(anthropic.Anthropic())
class JudgeVerdict(BaseModel):
score: float # 0.0 to 1.0
reasoning: str
passed: bool # True if score >= threshold
# ── Faithfulness judge ────────────────────────────────
FAITHFULNESS_JUDGE = """You are an expert evaluator. Determine whether every factual
claim in the ANSWER is directly supported by the CONTEXT.
Score 1.0: All claims are explicitly stated in the context.
Score 0.5: Most claims supported, some extrapolation.
Score 0.0: Major claims not in context — hallucination present.
CONTEXT:
{context}
ANSWER:
{answer}"""
def judge_faithfulness(context: str, answer: str,
threshold: float = 0.7) -> JudgeVerdict:
result = judge_client.messages.create(
model="claude-3-5-sonnet-20241022", # strong model for judging
max_tokens=300,
temperature=0.0,
messages=[{"role": "user",
"content": FAITHFULNESS_JUDGE.format(context=context, answer=answer)}],
response_model=JudgeVerdict
)
result.passed = result.score >= threshold
return result
# ── Task success judge ────────────────────────────────
TASK_SUCCESS_JUDGE = """Did the AI agent successfully complete the following task?
ORIGINAL TASK: {task}
AGENT'S OUTPUT: {output}
Score 1.0: Task fully completed — all requirements met.
Score 0.5: Task partially completed — some requirements missing.
Score 0.0: Task failed — output does not address the task."""
def judge_task_success(task: str, output: str) -> JudgeVerdict:
return judge_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200, temperature=0.0,
messages=[{"role": "user",
"content": TASK_SUCCESS_JUDGE.format(task=task, output=output)}],
response_model=JudgeVerdict
)LLM Judge Biases — Know These
Calibration- Position bias — judges prefer the first option shown in A/B comparisons. Always randomise ordering and average results.
- Verbosity bias — longer answers score higher even if less accurate. Penalise unnecessary length explicitly in your judge prompt.
- Self-preference bias — Claude tends to prefer Claude outputs, GPT prefers GPT outputs. Use a different model family as judge when evaluating your primary model.
- Sycophancy — judges rate answers higher if they seem confident. Include "do not be influenced by the confidence of the answer" in your judge prompt.
# Calibrate your judge against human ratings # Step 1: get 50 human-rated examples (your gold set) # Step 2: run your judge on the same 50 # Step 3: compute correlation (Pearson r or Spearman ρ) # Step 4: if r < 0.7, iterate on the judge prompt from scipy.stats import pearsonr, spearmanr def calibrate_judge(human_scores: list[float], judge_scores: list[float]) -> dict: pearson_r, _ = pearsonr(human_scores, judge_scores) spearman_r, _ = spearmanr(human_scores, judge_scores) agreement = sum(1 for h, j in zip(human_scores, judge_scores) if abs(h - j) < 0.2) / len(human_scores) return { "pearson_r": round(pearson_r, 3), "spearman_r": round(spearman_r, 3), "agreement": round(agreement, 3), "calibrated": pearson_r >= 0.7 # r ≥ 0.7 considered acceptable }
Evaluating Your RAG Pipeline End-to-End
Systematic# Build a ground truth dataset for RAG evaluation # Each test case: question + expected answer + expected source RAG_TEST_SET = [ { "question": "How does DPDK mempool initialisation work?", "ground_truth": "DPDK mempool uses rte_mempool_create() with a fixed pool of memory objects pre-allocated at startup.", "expected_source": "dpdk-guide-mempool.pdf", }, # ... 20+ test cases ] # Run full eval loop async def evaluate_rag_pipeline(pipeline, test_set: list) -> dict: scores = {"faithfulness": [], "relevancy": [], "hit_rate": [], "cost_usd": [], "latency_ms": []} for case in test_set: import time t_start = time.perf_counter() result = await pipeline.query(case["question"]) latency = (time.perf_counter() - t_start) * 1000 # Metric 1: Faithfulness faith = judge_faithfulness( context=" ".join(s["text"] for s in result["sources"]), answer=result["answer"] ) scores["faithfulness"].append(faith.score) # Metric 2: Answer relevancy (does answer address the question?) relevancy = judge_answer_relevancy(case["question"], result["answer"]) scores["relevancy"].append(relevancy.score) # Metric 3: Source hit rate expected = case["expected_source"] hit = any(expected in s.get("source", "") for s in result["sources"]) scores["hit_rate"].append(float(hit)) scores["latency_ms"].append(latency) def avg(lst): return round(sum(lst) / len(lst), 3) if lst else 0 return {k: avg(v) for k, v in scores.items()}</pre></div> </div> </div> </div>🕵Evaluating Agents — Task Success and Trajectory
Agent Specific# Agent evaluation is harder than RAG eval because: # 1. The "right answer" may not be unique # 2. The path matters, not just the destination # 3. Tool calls have side effects that are hard to undo @dataclass class AgentTestCase: task: str expected_outcome: str # what a successful completion looks like required_tools: list[str] = None # tools that MUST be called forbidden_tools: list[str] = None # tools that must NOT be called max_turns: int = 10 max_cost_usd: float = 0.50 AGENT_TEST_SET = [ AgentTestCase( task="Find the square root of 1764 and the current time", expected_outcome="Answer mentions 42 and current time", required_tools=["calculate", "get_current_time"], max_turns=5 ), AgentTestCase( task="Search for DPDK documentation on hugepages", expected_outcome="Returns information about hugepage configuration", required_tools=["search_web"], forbidden_tools=["send_email"], # should not email anyone ), ] class AgentEvaluator: def evaluate(self, agent_fn, test_case: AgentTestCase) -> dict: result = agent_fn(test_case.task) tools_called = result.get("tools_called", []) output = result.get("answer", "") turns = result.get("turns_used", 0) cost = result.get("cost_usd", 0) # Task success — LLM judge success = judge_task_success(test_case.task, output) # Required tools coverage tool_coverage = 1.0 if test_case.required_tools: called_set = set(tools_called) required = set(test_case.required_tools) tool_coverage = len(called_set & required) / len(required) # Forbidden tools check forbidden_used = [] if test_case.forbidden_tools: forbidden_used = [t for t in tools_called if t in test_case.forbidden_tools] # Efficiency: did it use more turns than needed? efficiency = min(1.0, (test_case.max_turns - turns) / test_case.max_turns + 0.5) return { "task_success": success.score, "task_passed": success.passed, "tool_coverage": tool_coverage, "forbidden_used": forbidden_used, "turns_used": turns, "cost_usd": cost, "efficiency": efficiency, "judge_reasoning": success.reasoning, } def evaluate_batch(self, agent_fn, test_set) -> dict: results = [self.evaluate(agent_fn, tc) for tc in test_set] successes = [r["task_success"] for r in results] return { "task_success_rate": sum(r["task_passed"] for r in results) / len(results), "avg_success_score": sum(successes) / len(successes), "avg_turns": sum(r["turns_used"] for r in results) / len(results), "avg_cost_usd": sum(r["cost_usd"] for r in results) / len(results), "forbidden_violations": sum(1 for r in results if r["forbidden_used"]), "n": len(results), }🔧DeepEval — Production Eval Framework
Frameworkpip install deepeval from deepeval import evaluate from deepeval.metrics import ( AnswerRelevancyMetric, FaithfulnessMetric, ContextualPrecisionMetric, ContextualRecallMetric, HallucinationMetric, ToxicityMetric, ) from deepeval.test_case import LLMTestCase # Define metrics metrics = [ AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"), FaithfulnessMetric(threshold=0.7, model="gpt-4o"), ContextualPrecisionMetric(threshold=0.7, model="gpt-4o"), ContextualRecallMetric(threshold=0.7, model="gpt-4o"), ] # Create a test case test_case = LLMTestCase( input="How does DPDK mempool work?", actual_output="DPDK mempool pre-allocates a fixed pool of memory objects...", expected_output="rte_mempool_create() creates a fixed-size pool...", # optional retrieval_context=["The DPDK mempool library provides an API to allocate..."] ) # Run evaluation results = evaluate([test_case], metrics) # Use in pytest for CI/CD regression testing from deepeval import assert_test import pytest @pytest.mark.parametrize("test_case", my_test_cases) def test_rag_quality(test_case): assert_test(test_case, metrics)📈Ragas — RAG Assessment Framework
RAG Specificpip install ragas from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) from datasets import Dataset # Prepare your evaluation dataset data = { "question": ["How does DPDK mempool work?", "What is VPP?"], "answer": ["DPDK mempool uses rte_mempool_create...", "VPP is Vector Packet Processor..."], "contexts": [ ["The mempool library provides...", "rte_mempool_create allocates..."], ["VPP is FD.io's data plane..."], ], "ground_truth": ["rte_mempool_create creates a fixed pool", "VPP processes vectors of packets"] } dataset = Dataset.from_dict(data) # Run Ragas evaluation result = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print(result) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.84, 'context_recall': 0.79} # Convert to pandas for analysis df = result.to_pandas() df.to_csv("rag_eval_results.csv", index=False) # Identify lowest-scoring questions → improve retrieval or chunking for those🔭LangSmith — Tracing and Evaluation Platform
Observabilitypip install langsmith import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = os.environ["LANGSMITH_API_KEY"] os.environ["LANGCHAIN_PROJECT"] = "my-rag-project" # All LangChain calls now auto-trace to LangSmith # Go to smith.langchain.com → see every run # ── Manual tracing (without LangChain) ─────────────── from langsmith import Client, traceable ls_client = Client() @traceable(name="rag_query", run_type="chain") def traced_rag_query(question: str) -> dict: # Your RAG pipeline here — every call is auto-logged result = rag_query(question) return result # ── Dataset-based evaluation ────────────────────────── # Create a dataset in LangSmith dataset = ls_client.create_dataset("rag-eval-set") ls_client.create_examples( inputs=[{"question": t["question"]} for t in RAG_TEST_SET], outputs=[{"answer": t["ground_truth"]} for t in RAG_TEST_SET], dataset_id=dataset.id ) # Define evaluator function def faithfulness_evaluator(run, example) -> dict: verdict = judge_faithfulness( context=run.outputs.get("context", ""), answer=run.outputs.get("answer", "") ) return {"key": "faithfulness", "score": verdict.score, "comment": verdict.reasoning} # Run evaluation against the dataset from langsmith.evaluation import evaluate as ls_evaluate results = ls_evaluate( traced_rag_query, data="rag-eval-set", evaluators=[faithfulness_evaluator], experiment_prefix="v2-reranked" ) # Results visible in LangSmith UI with charts, per-example scores, diffs💡 LangSmith's experiment comparison is its killer feature. Run your baseline (v1) and improved (v2) pipelines against the same dataset, and LangSmith shows a side-by-side diff of every metric. This is how you prove that a new reranker or chunking strategy improved quality without regressions.
FREE LEARNING RESOURCES
Type Resource Best For Docs DeepEval Documentation — docs.confident-ai.com Complete reference for DeepEval metrics. Covers RAG, agent, and LLM evaluation with pytest integration. Docs Ragas Documentation — docs.ragas.io RAG-specific metrics framework. Best for faithfulness, context precision/recall evaluation. Docs LangSmith Documentation — docs.smith.langchain.com Tracing, datasets, and experiment comparison. Essential for production AI observability. Course DeepLearning.AI: Building and Evaluating Advanced RAG (Free) Covers RAG evaluation end-to-end with Ragas. Hands-on notebooks included. Library Promptfoo — github.com/promptfoo/promptfoo Open-source prompt testing framework. Red-teaming, regression testing, and CI/CD integration. 🛠 Full Eval Harness — RAG + Agent on Real Dataset [Advanced] 4–5 daysBuild a complete evaluation harness that runs on every commit — the CI/CD layer for your AI system.
Part A — RAG Evaluation
- Build a 30-question test set from your M18 "Chat With Your Docs" app with ground truth answers
- Run Ragas evaluation: faithfulness, answer_relevancy, context_precision, context_recall
- Run baseline (no reranking) vs enhanced (Cohere reranker from M17) — compare all 4 metrics
- Export results to CSV, identify the 5 worst-performing questions and diagnose why
Part B — Agent Evaluation
- Build a 20-task test set for your M21 hardened agent with expected outcomes and required tools
- Run AgentEvaluator: task_success_rate, avg_turns, avg_cost, forbidden_violations
- Use LLM-as-judge for task success with calibrated faithfulness judge
Part C — CI Integration
- Write a pytest test file using DeepEval assertions
- The test fails if: faithfulness < 0.7 OR task_success_rate < 0.8 OR any forbidden tool used
- Run locally:
pytest eval_tests.py -vSkills: Ragas, DeepEval, LLM-as-judge, AgentEvaluator, pytest integration, regression baselines
LAB 1Build and Calibrate an LLM Judge
Objective: Build a faithfulness judge, calibrate it against human ratings, and measure its reliability.
1Generate 30 RAG outputs (question + context + answer) from your M18 pipeline — 10 clearly faithful, 10 clearly unfaithful (manually inject hallucinations), 10 borderline cases.2Rate all 30 yourself (0.0, 0.5, or 1.0). These are your human ratings — your gold set.3Run your LLM judge (Claude Haiku) on all 30. Compute pearson_r between human and judge scores using calibrate_judge().4If pearson_r < 0.7, iterate on the judge prompt — add clearer scoring criteria, add examples. Re-run until calibrated.5Test the 4 known biases: (a) verbosity — does a longer answer get higher score? (b) position — does ordering change scores in A/B? (c) self-preference — does Haiku prefer Haiku outputs? (d) confidence — does a confident wrong answer score higher?LAB 2Ragas End-to-End on Your RAG System
Objective: Run a full Ragas evaluation and use the results to drive a concrete improvement.
1Create a 20-question dataset for your M18 RAG system. Include: question, ground_truth, contexts (retrieved chunks), answer (your pipeline's output).2Run Ragas with all 4 metrics. Print the aggregate scores and the per-question DataFrame.3Identify the 3 questions with the lowest faithfulness score. Manually inspect: what did the answer say that wasn't in the context? Is this a retrieval failure or generation failure?4Fix the lowest-performing failure (e.g. rechunk, add reranker, strengthen grounding prompt). Re-run Ragas. Document: which metric improved? Did any regress?5Document the regression test rule: "Our faithfulness must be ≥ X and context_recall must be ≥ Y on this test set." Write a pytest assertion that enforces this.LAB 3Agent Evaluation — Measure Before You Improve
Objective: Establish a baseline for your M21 agent, identify the most common failure pattern, and measure improvement.</div>
1Write a 15-task test set for your M21 hardened agent covering: 5 simple (1-2 tools), 5 medium (3-4 tools), 5 complex (5+ tools or multi-step reasoning).2Run AgentEvaluator.evaluate_batch(). Record: task_success_rate, avg_turns, avg_cost, forbidden_violations.3For every failed task (task_success < 0.6), read the judge_reasoning and classify the failure: wrong tool selected, correct tool but wrong args, took too many turns, gave partial answer.4Fix the most common failure category (likely wrong tool selection or bad tool description). Re-run the evaluation. Show the before/after task_success_rate.P6-M22 MASTERY CHECKLIST
- Can name and define the 4 RAG metrics: faithfulness, answer relevancy, context precision, context recall
- Can name and define the 4 agent metrics: task success rate, tool call precision, trajectory accuracy, cost per task
- Can implement an LLM-as-judge with Pydantic structured output (score, reasoning, passed)
- Know the 4 LLM judge biases: position bias, verbosity bias, self-preference, sycophancy
- Can calibrate a judge against human ratings using pearson_r — r ≥ 0.7 is acceptable
- Can build a RAG test set with question, ground_truth, expected_source fields
- Can run a Ragas evaluation and interpret the per-metric scores
- Can use Ragas results to identify and fix specific failure cases</div>
- Can implement AgentTestCase with required_tools and forbidden_tools constraints
- Can run AgentEvaluator.evaluate_batch() and report task_success_rate and avg_cost
- Can set up LangSmith tracing with @traceable decorator
- Can create a LangSmith dataset and run an experiment with custom evaluators
- Can use DeepEval metrics in a pytest test that fails on quality regression
- Understand the eval-improve loop: measure baseline → find worst cases → fix → re-measure → repeat
- Completed Lab 1: LLM judge built and calibrated against human ratings
- Completed Lab 2: Ragas evaluation with improvement iteration
- Completed Lab 3: Agent evaluation baseline + fix + re-measure cycle
- Milestone project: full eval harness with pytest CI integration pushed to GitHub
✅ Part 6 Complete! Move to Part 7 — Production & Deployment to learn how to ship everything you've built into a real production environment.
🎉 Part 6 — Agents, Workflows & Evaluation Complete!
You can now build, harden, and measure production-grade AI agent systems.
Build agents from scratch with ReAct loopsDesign stateful agents with LangGraphImplement human-in-the-loop with interrupt/resumeDesign reliable tools with proper error contractsChoose the right workflow pattern (chain/route/parallel/agent)Detect and contain all 5 agent failure modesImplement cost circuit breakers and structured loggingEvaluate RAG with Ragas and agents with LLM-as-judge