P6-M22 - Evaluation Harnesses & Task Success Metrics

Part 6 — Agents, Workflows & Evaluation · Module 22 of 22

Evaluation Harnesses & Task Success Metrics

Measure what matters — RAG faithfulness, agent task success, and LLM-as-judge patterns

⏱ 1 Week 🟠 Intermediate–Advanced 🔧 DeepEval · Ragas · LangSmith · Promptfoo 📋 Prerequisite: P6-M21

🎯

What This Module Covers

Final Part 6 Module

You cannot improve what you cannot measure. Evaluation is what separates teams that ship reliable AI systems from teams that rely on vibes and hope. This module covers the full evaluation stack — from simple metrics you implement yourself to production-grade harnesses.

Key metrics — faithfulness, answer relevancy, context recall, task success rate, tool precision
LLM-as-judge — using an LLM to evaluate LLM outputs, calibration, and known biases
RAG evaluation — RAGAS framework: faithfulness, answer relevancy, context precision, context recall
Agent evaluation — task success rate, tool call efficiency, trajectory accuracy
DeepEval & Ragas — production eval frameworks with built-in metrics
LangSmith — tracing, datasets, evaluation runs, regression testing

📐

The Metrics That Matter

Know These by Name

RAG METRICS

Faithfulness

0–1

Are all claims in the answer supported by the retrieved context? 1.0 = fully grounded

Answer Relevancy

0–1

Does the answer actually address the question asked? High score = on-topic

Context Precision

0–1

Of the retrieved chunks, what fraction were actually useful? 1.0 = all relevant

Context Recall

0–1

Did retrieval find all the chunks needed to answer? 1.0 = nothing missed

AGENT METRICS

Task Success Rate

0–100%

% of tasks where the agent achieved the stated goal. The headline metric.

Tool Call Precision

0–1

Were all tool calls necessary? Unused/redundant calls lower this.

Trajectory Accuracy

0–1

Did the agent follow an efficient path? Compared to optimal sequence.

Cost per Task

Average USD spent per successful task completion.

GENERAL LLM METRICS

Correctness

0–1

Is the answer factually correct? Requires ground truth.

Coherence

0–1

Is the output well-structured and logically consistent?

Toxicity

0–1

Does output contain harmful content? 0.0 = safe.

🤖

LLM-as-Judge — Using AI to Evaluate AI

Core Technique

When you can't write deterministic evaluation rules (most LLM outputs), use a capable LLM as the judge. The key is calibration — your judge must agree with human raters on a validation set.

import anthropic
from pydantic import BaseModel
import instructor

judge_client = instructor.from_anthropic(anthropic.Anthropic())

class JudgeVerdict(BaseModel):
    score:      float   # 0.0 to 1.0
    reasoning:  str
    passed:     bool    # True if score >= threshold

# ── Faithfulness judge ────────────────────────────────
FAITHFULNESS_JUDGE = """You are an expert evaluator. Determine whether every factual
claim in the ANSWER is directly supported by the CONTEXT.

Score 1.0: All claims are explicitly stated in the context.
Score 0.5: Most claims supported, some extrapolation.
Score 0.0: Major claims not in context — hallucination present.

CONTEXT:
{context}

ANSWER:
{answer}"""

def judge_faithfulness(context: str, answer: str,
                        threshold: float = 0.7) -> JudgeVerdict:
    result = judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",   # strong model for judging
        max_tokens=300,
        temperature=0.0,
        messages=[{"role": "user",
                   "content": FAITHFULNESS_JUDGE.format(context=context, answer=answer)}],
        response_model=JudgeVerdict
    )
    result.passed = result.score >= threshold
    return result

# ── Task success judge ────────────────────────────────
TASK_SUCCESS_JUDGE = """Did the AI agent successfully complete the following task?

ORIGINAL TASK: {task}
AGENT'S OUTPUT: {output}

Score 1.0: Task fully completed — all requirements met.
Score 0.5: Task partially completed — some requirements missing.
Score 0.0: Task failed — output does not address the task."""

def judge_task_success(task: str, output: str) -> JudgeVerdict:
    return judge_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200, temperature=0.0,
        messages=[{"role": "user",
                   "content": TASK_SUCCESS_JUDGE.format(task=task, output=output)}],
        response_model=JudgeVerdict
    )

⚠️

LLM Judge Biases — Know These

Calibration

Position bias — judges prefer the first option shown in A/B comparisons. Always randomise ordering and average results.
Verbosity bias — longer answers score higher even if less accurate. Penalise unnecessary length explicitly in your judge prompt.
Self-preference bias — Claude tends to prefer Claude outputs, GPT prefers GPT outputs. Use a different model family as judge when evaluating your primary model.
Sycophancy — judges rate answers higher if they seem confident. Include "do not be influenced by the confidence of the answer" in your judge prompt.

# Calibrate your judge against human ratings
# Step 1: get 50 human-rated examples (your gold set)
# Step 2: run your judge on the same 50
# Step 3: compute correlation (Pearson r or Spearman ρ)
# Step 4: if r < 0.7, iterate on the judge prompt

from scipy.stats import pearsonr, spearmanr

def calibrate_judge(human_scores: list[float], judge_scores: list[float]) -> dict:
    pearson_r, _ = pearsonr(human_scores, judge_scores)
    spearman_r, _ = spearmanr(human_scores, judge_scores)
    agreement    = sum(1 for h, j in zip(human_scores, judge_scores)
                       if abs(h - j) < 0.2) / len(human_scores)
    return {
        "pearson_r":   round(pearson_r, 3),
        "spearman_r":  round(spearman_r, 3),
        "agreement":   round(agreement, 3),
        "calibrated": pearson_r >= 0.7   # r ≥ 0.7 considered acceptable
    }

📊

Evaluating Your RAG Pipeline End-to-End

Systematic

# Build a ground truth dataset for RAG evaluation
# Each test case: question + expected answer + expected source
RAG_TEST_SET = [
    {
        "question": "How does DPDK mempool initialisation work?",
        "ground_truth": "DPDK mempool uses rte_mempool_create() with a fixed pool of memory objects pre-allocated at startup.",
        "expected_source": "dpdk-guide-mempool.pdf",
    },
    # ... 20+ test cases
]

# Run full eval loop
async def evaluate_rag_pipeline(pipeline, test_set: list) -> dict:
    scores = {"faithfulness": [], "relevancy": [], "hit_rate": [],
              "cost_usd": [], "latency_ms": []}

    for case in test_set:
        import time
        t_start = time.perf_counter()
        result  = await pipeline.query(case["question"])
        latency = (time.perf_counter() - t_start) * 1000

        # Metric 1: Faithfulness
        faith = judge_faithfulness(
            context=" ".join(s["text"] for s in result["sources"]),
            answer=result["answer"]
        )
        scores["faithfulness"].append(faith.score)

        # Metric 2: Answer relevancy (does answer address the question?)
        relevancy = judge_answer_relevancy(case["question"], result["answer"])
        scores["relevancy"].append(relevancy.score)

        # Metric 3: Source hit rate
        expected = case["expected_source"]
        hit = any(expected in s.get("source", "") for s in result["sources"])
        scores["hit_rate"].append(float(hit))

        scores["latency_ms"].append(latency)

    def avg(lst): return round(sum(lst) / len(lst), 3) if lst else 0
    return {k: avg(v) for k, v in scores.items()}</pre></div>
  </div>
</div>
</div>





  🕵Evaluating Agents — Task Success and Trajectory
Agent Specific
  
    # Agent evaluation is harder than RAG eval because:
# 1. The "right answer" may not be unique
# 2. The path matters, not just the destination
# 3. Tool calls have side effects that are hard to undo

@dataclass
class AgentTestCase:
    task:              str
    expected_outcome:  str                # what a successful completion looks like
    required_tools:    list[str] = None   # tools that MUST be called
    forbidden_tools:   list[str] = None   # tools that must NOT be called
    max_turns:         int = 10
    max_cost_usd:      float = 0.50

AGENT_TEST_SET = [
    AgentTestCase(
        task="Find the square root of 1764 and the current time",
        expected_outcome="Answer mentions 42 and current time",
        required_tools=["calculate", "get_current_time"],
        max_turns=5
    ),
    AgentTestCase(
        task="Search for DPDK documentation on hugepages",
        expected_outcome="Returns information about hugepage configuration",
        required_tools=["search_web"],
        forbidden_tools=["send_email"],   # should not email anyone
    ),
]

class AgentEvaluator:
    def evaluate(self, agent_fn, test_case: AgentTestCase) -> dict:
        result = agent_fn(test_case.task)
        tools_called = result.get("tools_called", [])
        output       = result.get("answer", "")
        turns        = result.get("turns_used", 0)
        cost         = result.get("cost_usd", 0)

        # Task success — LLM judge
        success = judge_task_success(test_case.task, output)

        # Required tools coverage
        tool_coverage = 1.0
        if test_case.required_tools:
            called_set  = set(tools_called)
            required    = set(test_case.required_tools)
            tool_coverage = len(called_set & required) / len(required)

        # Forbidden tools check
        forbidden_used = []
        if test_case.forbidden_tools:
            forbidden_used = [t for t in tools_called if t in test_case.forbidden_tools]

        # Efficiency: did it use more turns than needed?
        efficiency = min(1.0, (test_case.max_turns - turns) / test_case.max_turns + 0.5)

        return {
            "task_success":   success.score,
            "task_passed":    success.passed,
            "tool_coverage":  tool_coverage,
            "forbidden_used": forbidden_used,
            "turns_used":     turns,
            "cost_usd":       cost,
            "efficiency":     efficiency,
            "judge_reasoning": success.reasoning,
        }

    def evaluate_batch(self, agent_fn, test_set) -> dict:
        results   = [self.evaluate(agent_fn, tc) for tc in test_set]
        successes = [r["task_success"] for r in results]
        return {
            "task_success_rate": sum(r["task_passed"] for r in results) / len(results),
            "avg_success_score": sum(successes) / len(successes),
            "avg_turns":         sum(r["turns_used"] for r in results) / len(results),
            "avg_cost_usd":      sum(r["cost_usd"] for r in results) / len(results),
            "forbidden_violations": sum(1 for r in results if r["forbidden_used"]),
            "n":                 len(results),
        }
  







  🔧DeepEval — Production Eval Framework
Framework
  
    pip install deepeval

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric, FaithfulnessMetric,
    ContextualPrecisionMetric, ContextualRecallMetric,
    HallucinationMetric, ToxicityMetric,
)
from deepeval.test_case import LLMTestCase

# Define metrics
metrics = [
    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),
    FaithfulnessMetric(threshold=0.7, model="gpt-4o"),
    ContextualPrecisionMetric(threshold=0.7, model="gpt-4o"),
    ContextualRecallMetric(threshold=0.7, model="gpt-4o"),
]

# Create a test case
test_case = LLMTestCase(
    input="How does DPDK mempool work?",
    actual_output="DPDK mempool pre-allocates a fixed pool of memory objects...",
    expected_output="rte_mempool_create() creates a fixed-size pool...",   # optional
    retrieval_context=["The DPDK mempool library provides an API to allocate..."]
)

# Run evaluation
results = evaluate([test_case], metrics)

# Use in pytest for CI/CD regression testing
from deepeval import assert_test
import pytest

@pytest.mark.parametrize("test_case", my_test_cases)
def test_rag_quality(test_case):
    assert_test(test_case, metrics)
  



  📈Ragas — RAG Assessment Framework
RAG Specific
  
    pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall
)
from datasets import Dataset

# Prepare your evaluation dataset
data = {
    "question": ["How does DPDK mempool work?", "What is VPP?"],
    "answer":   ["DPDK mempool uses rte_mempool_create...", "VPP is Vector Packet Processor..."],
    "contexts": [
        ["The mempool library provides...", "rte_mempool_create allocates..."],
        ["VPP is FD.io's data plane..."],
    ],
    "ground_truth": ["rte_mempool_create creates a fixed pool", "VPP processes vectors of packets"]
}
dataset = Dataset.from_dict(data)

# Run Ragas evaluation
result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.84, 'context_recall': 0.79}

# Convert to pandas for analysis
df = result.to_pandas()
df.to_csv("rag_eval_results.csv", index=False)
# Identify lowest-scoring questions → improve retrieval or chunking for those
  







  🔭LangSmith — Tracing and Evaluation Platform
Observability
  
    pip install langsmith

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"]      = os.environ["LANGSMITH_API_KEY"]
os.environ["LANGCHAIN_PROJECT"]       = "my-rag-project"

# All LangChain calls now auto-trace to LangSmith
# Go to smith.langchain.com → see every run

# ── Manual tracing (without LangChain) ───────────────
from langsmith import Client, traceable

ls_client = Client()

@traceable(name="rag_query", run_type="chain")
def traced_rag_query(question: str) -> dict:
    # Your RAG pipeline here — every call is auto-logged
    result = rag_query(question)
    return result

# ── Dataset-based evaluation ──────────────────────────
# Create a dataset in LangSmith
dataset = ls_client.create_dataset("rag-eval-set")
ls_client.create_examples(
    inputs=[{"question": t["question"]} for t in RAG_TEST_SET],
    outputs=[{"answer": t["ground_truth"]} for t in RAG_TEST_SET],
    dataset_id=dataset.id
)

# Define evaluator function
def faithfulness_evaluator(run, example) -> dict:
    verdict = judge_faithfulness(
        context=run.outputs.get("context", ""),
        answer=run.outputs.get("answer", "")
    )
    return {"key": "faithfulness", "score": verdict.score,
            "comment": verdict.reasoning}

# Run evaluation against the dataset
from langsmith.evaluation import evaluate as ls_evaluate

results = ls_evaluate(
    traced_rag_query,
    data="rag-eval-set",
    evaluators=[faithfulness_evaluator],
    experiment_prefix="v2-reranked"
)
# Results visible in LangSmith UI with charts, per-example scores, diffs
    💡 LangSmith's experiment comparison is its killer feature. Run your baseline (v1) and improved (v2) pipelines against the same dataset, and LangSmith shows a side-by-side diff of every metric. This is how you prove that a new reranker or chunking strategy improved quality without regressions.
  






FREE LEARNING RESOURCES

  Type Resource Best For
  
    Docs DeepEval Documentation — docs.confident-ai.com Complete reference for DeepEval metrics. Covers RAG, agent, and LLM evaluation with pytest integration.
    Docs Ragas Documentation — docs.ragas.io RAG-specific metrics framework. Best for faithfulness, context precision/recall evaluation.
    Docs LangSmith Documentation — docs.smith.langchain.com Tracing, datasets, and experiment comparison. Essential for production AI observability.
    Course DeepLearning.AI: Building and Evaluating Advanced RAG (Free) Covers RAG evaluation end-to-end with Ragas. Hands-on notebooks included.
    Library Promptfoo — github.com/promptfoo/promptfoo Open-source prompt testing framework. Red-teaming, regression testing, and CI/CD integration.
  







  
    🛠
    Full Eval Harness — RAG + Agent on Real Dataset
    [Advanced] 4–5 days
  
  
    Build a complete evaluation harness that runs on every commit — the CI/CD layer for your AI system.
    Part A — RAG Evaluation
    
      Build a 30-question test set from your M18 "Chat With Your Docs" app with ground truth answers
      Run Ragas evaluation: faithfulness, answer_relevancy, context_precision, context_recall
      Run baseline (no reranking) vs enhanced (Cohere reranker from M17) — compare all 4 metrics
      Export results to CSV, identify the 5 worst-performing questions and diagnose why
    
    Part B — Agent Evaluation
    
      Build a 20-task test set for your M21 hardened agent with expected outcomes and required tools
      Run AgentEvaluator: task_success_rate, avg_turns, avg_cost, forbidden_violations
      Use LLM-as-judge for task success with calibrated faithfulness judge
    
    Part C — CI Integration
    
      Write a pytest test file using DeepEval assertions
      The test fails if: faithfulness < 0.7 OR task_success_rate < 0.8 OR any forbidden tool used
      Run locally: pytest eval_tests.py -v
    
    Skills: Ragas, DeepEval, LLM-as-judge, AgentEvaluator, pytest integration, regression baselines
  







  LAB 1Build and Calibrate an LLM Judge
  
    Objective: Build a faithfulness judge, calibrate it against human ratings, and measure its reliability.
    1
Generate 30 RAG outputs (question + context + answer) from your M18 pipeline — 10 clearly faithful, 10 clearly unfaithful (manually inject hallucinations), 10 borderline cases.
    2
Rate all 30 yourself (0.0, 0.5, or 1.0). These are your human ratings — your gold set.
    3
Run your LLM judge (Claude Haiku) on all 30. Compute pearson_r between human and judge scores using calibrate_judge().
    4
If pearson_r < 0.7, iterate on the judge prompt — add clearer scoring criteria, add examples. Re-run until calibrated.
    5
Test the 4 known biases: (a) verbosity — does a longer answer get higher score? (b) position — does ordering change scores in A/B? (c) self-preference — does Haiku prefer Haiku outputs? (d) confidence — does a confident wrong answer score higher?
  



  LAB 2Ragas End-to-End on Your RAG System
  
    Objective: Run a full Ragas evaluation and use the results to drive a concrete improvement.
    1
Create a 20-question dataset for your M18 RAG system. Include: question, ground_truth, contexts (retrieved chunks), answer (your pipeline's output).
    2
Run Ragas with all 4 metrics. Print the aggregate scores and the per-question DataFrame.
    3
Identify the 3 questions with the lowest faithfulness score. Manually inspect: what did the answer say that wasn't in the context? Is this a retrieval failure or generation failure?
    4
Fix the lowest-performing failure (e.g. rechunk, add reranker, strengthen grounding prompt). Re-run Ragas. Document: which metric improved? Did any regress?
    5
Document the regression test rule: "Our faithfulness must be ≥ X and context_recall must be ≥ Y on this test set." Write a pytest assertion that enforces this.
  



  LAB 3Agent Evaluation — Measure Before You Improve
  
    Objective: Establish a baseline for your M21 agent, identify the most common failure pattern, and measure improvement.</div>
    1
Write a 15-task test set for your M21 hardened agent covering: 5 simple (1-2 tools), 5 medium (3-4 tools), 5 complex (5+ tools or multi-step reasoning).
    2
Run AgentEvaluator.evaluate_batch(). Record: task_success_rate, avg_turns, avg_cost, forbidden_violations.
    3
For every failed task (task_success < 0.6), read the judge_reasoning and classify the failure: wrong tool selected, correct tool but wrong args, took too many turns, gave partial answer.
    4
Fix the most common failure category (likely wrong tool selection or bad tool description). Re-run the evaluation. Show the before/after task_success_rate.
  






P6-M22 MASTERY CHECKLIST

  Can name and define the 4 RAG metrics: faithfulness, answer relevancy, context precision, context recall
  Can name and define the 4 agent metrics: task success rate, tool call precision, trajectory accuracy, cost per task
  Can implement an LLM-as-judge with Pydantic structured output (score, reasoning, passed)
  Know the 4 LLM judge biases: position bias, verbosity bias, self-preference, sycophancy
  Can calibrate a judge against human ratings using pearson_r — r ≥ 0.7 is acceptable
  Can build a RAG test set with question, ground_truth, expected_source fields
  Can run a Ragas evaluation and interpret the per-metric scores
  Can use Ragas results to identify and fix specific failure cases</div>
  Can implement AgentTestCase with required_tools and forbidden_tools constraints
  Can run AgentEvaluator.evaluate_batch() and report task_success_rate and avg_cost
  Can set up LangSmith tracing with @traceable decorator
  Can create a LangSmith dataset and run an experiment with custom evaluators
  Can use DeepEval metrics in a pytest test that fails on quality regression
  Understand the eval-improve loop: measure baseline → find worst cases → fix → re-measure → repeat
  Completed Lab 1: LLM judge built and calibrated against human ratings
  Completed Lab 2: Ragas evaluation with improvement iteration
  Completed Lab 3: Agent evaluation baseline + fix + re-measure cycle
  Milestone project: full eval harness with pytest CI integration pushed to GitHub


  ✅ Part 6 Complete! Move to Part 7 — Production & Deployment to learn how to ship everything you've built into a real production environment.





  🎉 Part 6 — Agents, Workflows & Evaluation Complete!
  You can now build, harden, and measure production-grade AI agent systems.
  
    Build agents from scratch with ReAct loops
    Design stateful agents with LangGraph
    Implement human-in-the-loop with interrupt/resume
    Design reliable tools with proper error contracts
    Choose the right workflow pattern (chain/route/parallel/agent)
    Detect and contain all 5 agent failure modes
    Implement cost circuit breakers and structured logging
    Evaluate RAG with Ragas and agents with LLM-as-judge
  



  ← P6-M21: Failure Handling
  🗺️ All Modules
  Next: P7-M23 — FastAPI Production →

Type	Resource	Best For
Docs	DeepEval Documentation — docs.confident-ai.com	Complete reference for DeepEval metrics. Covers RAG, agent, and LLM evaluation with pytest integration.
Docs	Ragas Documentation — docs.ragas.io	RAG-specific metrics framework. Best for faithfulness, context precision/recall evaluation.
Docs	LangSmith Documentation — docs.smith.langchain.com	Tracing, datasets, and experiment comparison. Essential for production AI observability.
Course	DeepLearning.AI: Building and Evaluating Advanced RAG (Free)	Covers RAG evaluation end-to-end with Ragas. Hands-on notebooks included.
Library	Promptfoo — github.com/promptfoo/promptfoo	Open-source prompt testing framework. Red-teaming, regression testing, and CI/CD integration.