Part 6 — Agents, Workflows & Evaluation  ·  Module 22 of 22
Evaluation Harnesses & Task Success Metrics
Measure what matters — RAG faithfulness, agent task success, and LLM-as-judge patterns
⏱ 1 Week 🟠 Intermediate–Advanced 🔧 DeepEval · Ragas · LangSmith · Promptfoo 📋 Prerequisite: P6-M21
🎯

What This Module Covers

Final Part 6 Module

You cannot improve what you cannot measure. Evaluation is what separates teams that ship reliable AI systems from teams that rely on vibes and hope. This module covers the full evaluation stack — from simple metrics you implement yourself to production-grade harnesses.

  • Key metrics — faithfulness, answer relevancy, context recall, task success rate, tool precision
  • LLM-as-judge — using an LLM to evaluate LLM outputs, calibration, and known biases
  • RAG evaluation — RAGAS framework: faithfulness, answer relevancy, context precision, context recall
  • Agent evaluation — task success rate, tool call efficiency, trajectory accuracy
  • DeepEval & Ragas — production eval frameworks with built-in metrics
  • LangSmith — tracing, datasets, evaluation runs, regression testing
📐

The Metrics That Matter

Know These by Name

RAG METRICS

Faithfulness
0–1
Are all claims in the answer supported by the retrieved context? 1.0 = fully grounded
Answer Relevancy
0–1
Does the answer actually address the question asked? High score = on-topic
Context Precision
0–1
Of the retrieved chunks, what fraction were actually useful? 1.0 = all relevant
Context Recall
0–1
Did retrieval find all the chunks needed to answer? 1.0 = nothing missed

AGENT METRICS

Task Success Rate
0–100%
% of tasks where the agent achieved the stated goal. The headline metric.
Tool Call Precision
0–1
Were all tool calls necessary? Unused/redundant calls lower this.
Trajectory Accuracy
0–1
Did the agent follow an efficient path? Compared to optimal sequence.
Cost per Task
$
Average USD spent per successful task completion.

GENERAL LLM METRICS

Correctness
0–1
Is the answer factually correct? Requires ground truth.
Coherence
0–1
Is the output well-structured and logically consistent?
Toxicity
0–1
Does output contain harmful content? 0.0 = safe.
🤖

LLM-as-Judge — Using AI to Evaluate AI

Core Technique

When you can't write deterministic evaluation rules (most LLM outputs), use a capable LLM as the judge. The key is calibration — your judge must agree with human raters on a validation set.

import anthropic
from pydantic import BaseModel
import instructor

judge_client = instructor.from_anthropic(anthropic.Anthropic())

class JudgeVerdict(BaseModel):
    score:      float   # 0.0 to 1.0
    reasoning:  str
    passed:     bool    # True if score >= threshold

# ── Faithfulness judge ────────────────────────────────
FAITHFULNESS_JUDGE = """You are an expert evaluator. Determine whether every factual
claim in the ANSWER is directly supported by the CONTEXT.

Score 1.0: All claims are explicitly stated in the context.
Score 0.5: Most claims supported, some extrapolation.
Score 0.0: Major claims not in context — hallucination present.

CONTEXT:
{context}

ANSWER:
{answer}"""

def judge_faithfulness(context: str, answer: str,
                        threshold: float = 0.7) -> JudgeVerdict:
    result = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",   # strong model for judging
        max_tokens=300,
        temperature=0.0,
        messages=[{"role": "user",
                   "content": FAITHFULNESS_JUDGE.format(context=context, answer=answer)}],
        response_model=JudgeVerdict
    )
    result.passed = result.score >= threshold
    return result

# ── Task success judge ────────────────────────────────
TASK_SUCCESS_JUDGE = """Did the AI agent successfully complete the following task?

ORIGINAL TASK: {task}
AGENT'S OUTPUT: {output}

Score 1.0: Task fully completed — all requirements met.
Score 0.5: Task partially completed — some requirements missing.
Score 0.0: Task failed — output does not address the task."""

def judge_task_success(task: str, output: str) -> JudgeVerdict:
    return judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200, temperature=0.0,
        messages=[{"role": "user",
                   "content": TASK_SUCCESS_JUDGE.format(task=task, output=output)}],
        response_model=JudgeVerdict
    )
⚠️

LLM Judge Biases — Know These

Calibration
  • Position bias — judges prefer the first option shown in A/B comparisons. Always randomise ordering and average results.
  • Verbosity bias — longer answers score higher even if less accurate. Penalise unnecessary length explicitly in your judge prompt.
  • Self-preference bias — Claude tends to prefer Claude outputs, GPT prefers GPT outputs. Use a different model family as judge when evaluating your primary model.
  • Sycophancy — judges rate answers higher if they seem confident. Include "do not be influenced by the confidence of the answer" in your judge prompt.
# Calibrate your judge against human ratings
# Step 1: get 50 human-rated examples (your gold set)
# Step 2: run your judge on the same 50
# Step 3: compute correlation (Pearson r or Spearman ρ)
# Step 4: if r < 0.7, iterate on the judge prompt

from scipy.stats import pearsonr, spearmanr

def calibrate_judge(human_scores: list[float], judge_scores: list[float]) -> dict:
    pearson_r, _ = pearsonr(human_scores, judge_scores)
    spearman_r, _ = spearmanr(human_scores, judge_scores)
    agreement    = sum(1 for h, j in zip(human_scores, judge_scores)
                       if abs(h - j) < 0.2) / len(human_scores)
    return {
        "pearson_r":   round(pearson_r, 3),
        "spearman_r":  round(spearman_r, 3),
        "agreement":   round(agreement, 3),
        "calibrated": pearson_r >= 0.7   # r ≥ 0.7 considered acceptable
    }
📊

Evaluating Your RAG Pipeline End-to-End

Systematic
# Build a ground truth dataset for RAG evaluation
# Each test case: question + expected answer + expected source
RAG_TEST_SET = [
    {
        "question": "How does DPDK mempool initialisation work?",
        "ground_truth": "DPDK mempool uses rte_mempool_create() with a fixed pool of memory objects pre-allocated at startup.",
        "expected_source": "dpdk-guide-mempool.pdf",
    },
    # ... 20+ test cases
]

# Run full eval loop
async def evaluate_rag_pipeline(pipeline, test_set: list) -> dict:
    scores = {"faithfulness": [], "relevancy": [], "hit_rate": [],
              "cost_usd": [], "latency_ms": []}

    for case in test_set:
        import time
        t_start = time.perf_counter()
        result  = await pipeline.query(case["question"])
        latency = (time.perf_counter() - t_start) * 1000

        # Metric 1: Faithfulness
        faith = judge_faithfulness(
            context=" ".join(s["text"] for s in result["sources"]),
            answer=result["answer"]
        )
        scores["faithfulness"].append(faith.score)

        # Metric 2: Answer relevancy (does answer address the question?)
        relevancy = judge_answer_relevancy(case["question"], result["answer"])
        scores["relevancy"].append(relevancy.score)

        # Metric 3: Source hit rate
        expected = case["expected_source"]
        hit = any(expected in s.get("source", "") for s in result["sources"])
        scores["hit_rate"].append(float(hit))

        scores["latency_ms"].append(latency)

    def avg(lst): return round(sum(lst) / len(lst), 3) if lst else 0
    return {k: avg(v) for k, v in scores.items()}</pre></div>
  </div>
</div>
</div>



🕵

Evaluating Agents — Task Success and Trajectory

Agent Specific
# Agent evaluation is harder than RAG eval because:
# 1. The "right answer" may not be unique
# 2. The path matters, not just the destination
# 3. Tool calls have side effects that are hard to undo

@dataclass
class AgentTestCase:
    task:              str
    expected_outcome:  str                # what a successful completion looks like
    required_tools:    list[str] = None   # tools that MUST be called
    forbidden_tools:   list[str] = None   # tools that must NOT be called
    max_turns:         int = 10
    max_cost_usd:      float = 0.50

AGENT_TEST_SET = [
    AgentTestCase(
        task="Find the square root of 1764 and the current time",
        expected_outcome="Answer mentions 42 and current time",
        required_tools=["calculate", "get_current_time"],
        max_turns=5
    ),
    AgentTestCase(
        task="Search for DPDK documentation on hugepages",
        expected_outcome="Returns information about hugepage configuration",
        required_tools=["search_web"],
        forbidden_tools=["send_email"],   # should not email anyone
    ),
]

class AgentEvaluator:
    def evaluate(self, agent_fn, test_case: AgentTestCase) -> dict:
        result = agent_fn(test_case.task)
        tools_called = result.get("tools_called", [])
        output       = result.get("answer", "")
        turns        = result.get("turns_used", 0)
        cost         = result.get("cost_usd", 0)

        # Task success — LLM judge
        success = judge_task_success(test_case.task, output)

        # Required tools coverage
        tool_coverage = 1.0
        if test_case.required_tools:
            called_set  = set(tools_called)
            required    = set(test_case.required_tools)
            tool_coverage = len(called_set & required) / len(required)

        # Forbidden tools check
        forbidden_used = []
        if test_case.forbidden_tools:
            forbidden_used = [t for t in tools_called if t in test_case.forbidden_tools]

        # Efficiency: did it use more turns than needed?
        efficiency = min(1.0, (test_case.max_turns - turns) / test_case.max_turns + 0.5)

        return {
            "task_success":   success.score,
            "task_passed":    success.passed,
            "tool_coverage":  tool_coverage,
            "forbidden_used": forbidden_used,
            "turns_used":     turns,
            "cost_usd":       cost,
            "efficiency":     efficiency,
            "judge_reasoning": success.reasoning,
        }

    def evaluate_batch(self, agent_fn, test_set) -> dict:
        results   = [self.evaluate(agent_fn, tc) for tc in test_set]
        successes = [r["task_success"] for r in results]
        return {
            "task_success_rate": sum(r["task_passed"] for r in results) / len(results),
            "avg_success_score": sum(successes) / len(successes),
            "avg_turns":         sum(r["turns_used"] for r in results) / len(results),
            "avg_cost_usd":      sum(r["cost_usd"] for r in results) / len(results),
            "forbidden_violations": sum(1 for r in results if r["forbidden_used"]),
            "n":                 len(results),
        }
🔧

DeepEval — Production Eval Framework

Framework
pip install deepeval

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric, FaithfulnessMetric,
    ContextualPrecisionMetric, ContextualRecallMetric,
    HallucinationMetric, ToxicityMetric,
)
from deepeval.test_case import LLMTestCase

# Define metrics
metrics = [
    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),
    FaithfulnessMetric(threshold=0.7, model="gpt-4o"),
    ContextualPrecisionMetric(threshold=0.7, model="gpt-4o"),
    ContextualRecallMetric(threshold=0.7, model="gpt-4o"),
]

# Create a test case
test_case = LLMTestCase(
    input="How does DPDK mempool work?",
    actual_output="DPDK mempool pre-allocates a fixed pool of memory objects...",
    expected_output="rte_mempool_create() creates a fixed-size pool...",   # optional
    retrieval_context=["The DPDK mempool library provides an API to allocate..."]
)

# Run evaluation
results = evaluate([test_case], metrics)

# Use in pytest for CI/CD regression testing
from deepeval import assert_test
import pytest

@pytest.mark.parametrize("test_case", my_test_cases)
def test_rag_quality(test_case):
    assert_test(test_case, metrics)
📈

Ragas — RAG Assessment Framework

RAG Specific
pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset

# Prepare your evaluation dataset
data = {
    "question": ["How does DPDK mempool work?", "What is VPP?"],
    "answer":   ["DPDK mempool uses rte_mempool_create...", "VPP is Vector Packet Processor..."],
    "contexts": [
        ["The mempool library provides...", "rte_mempool_create allocates..."],
        ["VPP is FD.io's data plane..."],
    ],
    "ground_truth": ["rte_mempool_create creates a fixed pool", "VPP processes vectors of packets"]
}
dataset = Dataset.from_dict(data)

# Run Ragas evaluation
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.84, 'context_recall': 0.79}

# Convert to pandas for analysis
df = result.to_pandas()
df.to_csv("rag_eval_results.csv", index=False)
# Identify lowest-scoring questions → improve retrieval or chunking for those
🔭

LangSmith — Tracing and Evaluation Platform

Observability
pip install langsmith

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]      = os.environ["LANGSMITH_API_KEY"]
os.environ["LANGCHAIN_PROJECT"]       = "my-rag-project"

# All LangChain calls now auto-trace to LangSmith
# Go to smith.langchain.com → see every run

# ── Manual tracing (without LangChain) ───────────────
from langsmith import Client, traceable

ls_client = Client()

@traceable(name="rag_query", run_type="chain")
def traced_rag_query(question: str) -> dict:
    # Your RAG pipeline here — every call is auto-logged
    result = rag_query(question)
    return result

# ── Dataset-based evaluation ──────────────────────────
# Create a dataset in LangSmith
dataset = ls_client.create_dataset("rag-eval-set")
ls_client.create_examples(
    inputs=[{"question": t["question"]} for t in RAG_TEST_SET],
    outputs=[{"answer": t["ground_truth"]} for t in RAG_TEST_SET],
    dataset_id=dataset.id
)

# Define evaluator function
def faithfulness_evaluator(run, example) -> dict:
    verdict = judge_faithfulness(
        context=run.outputs.get("context", ""),
        answer=run.outputs.get("answer", "")
    )
    return {"key": "faithfulness", "score": verdict.score,
            "comment": verdict.reasoning}

# Run evaluation against the dataset
from langsmith.evaluation import evaluate as ls_evaluate

results = ls_evaluate(
    traced_rag_query,
    data="rag-eval-set",
    evaluators=[faithfulness_evaluator],
    experiment_prefix="v2-reranked"
)
# Results visible in LangSmith UI with charts, per-example scores, diffs

💡 LangSmith's experiment comparison is its killer feature. Run your baseline (v1) and improved (v2) pipelines against the same dataset, and LangSmith shows a side-by-side diff of every metric. This is how you prove that a new reranker or chunking strategy improved quality without regressions.

FREE LEARNING RESOURCES

TypeResourceBest For
DocsDeepEval Documentation — docs.confident-ai.comComplete reference for DeepEval metrics. Covers RAG, agent, and LLM evaluation with pytest integration.
DocsRagas Documentation — docs.ragas.ioRAG-specific metrics framework. Best for faithfulness, context precision/recall evaluation.
DocsLangSmith Documentation — docs.smith.langchain.comTracing, datasets, and experiment comparison. Essential for production AI observability.
CourseDeepLearning.AI: Building and Evaluating Advanced RAG (Free)Covers RAG evaluation end-to-end with Ragas. Hands-on notebooks included.
LibraryPromptfoo — github.com/promptfoo/promptfooOpen-source prompt testing framework. Red-teaming, regression testing, and CI/CD integration.
🛠 Full Eval Harness — RAG + Agent on Real Dataset [Advanced] 4–5 days

Build a complete evaluation harness that runs on every commit — the CI/CD layer for your AI system.

Part A — RAG Evaluation

  • Build a 30-question test set from your M18 "Chat With Your Docs" app with ground truth answers
  • Run Ragas evaluation: faithfulness, answer_relevancy, context_precision, context_recall
  • Run baseline (no reranking) vs enhanced (Cohere reranker from M17) — compare all 4 metrics
  • Export results to CSV, identify the 5 worst-performing questions and diagnose why

Part B — Agent Evaluation

  • Build a 20-task test set for your M21 hardened agent with expected outcomes and required tools
  • Run AgentEvaluator: task_success_rate, avg_turns, avg_cost, forbidden_violations
  • Use LLM-as-judge for task success with calibrated faithfulness judge

Part C — CI Integration

  • Write a pytest test file using DeepEval assertions
  • The test fails if: faithfulness < 0.7 OR task_success_rate < 0.8 OR any forbidden tool used
  • Run locally: pytest eval_tests.py -v

Skills: Ragas, DeepEval, LLM-as-judge, AgentEvaluator, pytest integration, regression baselines

LAB 1

Build and Calibrate an LLM Judge

Objective: Build a faithfulness judge, calibrate it against human ratings, and measure its reliability.

1
Generate 30 RAG outputs (question + context + answer) from your M18 pipeline — 10 clearly faithful, 10 clearly unfaithful (manually inject hallucinations), 10 borderline cases.
2
Rate all 30 yourself (0.0, 0.5, or 1.0). These are your human ratings — your gold set.
3
Run your LLM judge (Claude Haiku) on all 30. Compute pearson_r between human and judge scores using calibrate_judge().
4
If pearson_r < 0.7, iterate on the judge prompt — add clearer scoring criteria, add examples. Re-run until calibrated.
5
Test the 4 known biases: (a) verbosity — does a longer answer get higher score? (b) position — does ordering change scores in A/B? (c) self-preference — does Haiku prefer Haiku outputs? (d) confidence — does a confident wrong answer score higher?
LAB 2

Ragas End-to-End on Your RAG System

Objective: Run a full Ragas evaluation and use the results to drive a concrete improvement.

1
Create a 20-question dataset for your M18 RAG system. Include: question, ground_truth, contexts (retrieved chunks), answer (your pipeline's output).
2
Run Ragas with all 4 metrics. Print the aggregate scores and the per-question DataFrame.
3
Identify the 3 questions with the lowest faithfulness score. Manually inspect: what did the answer say that wasn't in the context? Is this a retrieval failure or generation failure?
4
Fix the lowest-performing failure (e.g. rechunk, add reranker, strengthen grounding prompt). Re-run Ragas. Document: which metric improved? Did any regress?
5
Document the regression test rule: "Our faithfulness must be ≥ X and context_recall must be ≥ Y on this test set." Write a pytest assertion that enforces this.
LAB 3

Agent Evaluation — Measure Before You Improve

Objective: Establish a baseline for your M21 agent, identify the most common failure pattern, and measure improvement.</div>

1
Write a 15-task test set for your M21 hardened agent covering: 5 simple (1-2 tools), 5 medium (3-4 tools), 5 complex (5+ tools or multi-step reasoning).
2
Run AgentEvaluator.evaluate_batch(). Record: task_success_rate, avg_turns, avg_cost, forbidden_violations.
3
For every failed task (task_success < 0.6), read the judge_reasoning and classify the failure: wrong tool selected, correct tool but wrong args, took too many turns, gave partial answer.
4
Fix the most common failure category (likely wrong tool selection or bad tool description). Re-run the evaluation. Show the before/after task_success_rate.

P6-M22 MASTERY CHECKLIST

  • Can name and define the 4 RAG metrics: faithfulness, answer relevancy, context precision, context recall
  • Can name and define the 4 agent metrics: task success rate, tool call precision, trajectory accuracy, cost per task
  • Can implement an LLM-as-judge with Pydantic structured output (score, reasoning, passed)
  • Know the 4 LLM judge biases: position bias, verbosity bias, self-preference, sycophancy
  • Can calibrate a judge against human ratings using pearson_r — r ≥ 0.7 is acceptable
  • Can build a RAG test set with question, ground_truth, expected_source fields
  • Can run a Ragas evaluation and interpret the per-metric scores
  • Can use Ragas results to identify and fix specific failure cases</div>
  • Can implement AgentTestCase with required_tools and forbidden_tools constraints
  • Can run AgentEvaluator.evaluate_batch() and report task_success_rate and avg_cost
  • Can set up LangSmith tracing with @traceable decorator
  • Can create a LangSmith dataset and run an experiment with custom evaluators
  • Can use DeepEval metrics in a pytest test that fails on quality regression
  • Understand the eval-improve loop: measure baseline → find worst cases → fix → re-measure → repeat
  • Completed Lab 1: LLM judge built and calibrated against human ratings
  • Completed Lab 2: Ragas evaluation with improvement iteration
  • Completed Lab 3: Agent evaluation baseline + fix + re-measure cycle
  • Milestone project: full eval harness with pytest CI integration pushed to GitHub

Part 6 Complete! Move to Part 7 — Production & Deployment to learn how to ship everything you've built into a real production environment.

🎉 Part 6 — Agents, Workflows & Evaluation Complete!

You can now build, harden, and measure production-grade AI agent systems.

Build agents from scratch with ReAct loops
Design stateful agents with LangGraph
Implement human-in-the-loop with interrupt/resume
Design reliable tools with proper error contracts
Choose the right workflow pattern (chain/route/parallel/agent)
Detect and contain all 5 agent failure modes
Implement cost circuit breakers and structured logging
Evaluate RAG with Ragas and agents with LLM-as-judge