P6-M21 - Failure Handling in Agents

Part 6 — Agents, Workflows & Evaluation · Module 21 of 22

Failure Handling in Agents

Loops, stuck states, hallucinated tool calls, runaway costs — and how to handle all of them

⏱ 1 Week 🟠 Intermediate–Advanced 🔧 LangGraph · Tenacity · Structlog 📋 Prerequisite: P6-M20

🎯

What This Module Covers

Production Hardening

Agents fail in ways chains never do. They can loop forever, call tools that don't exist, spend your entire monthly API budget in 10 minutes, or get stuck unable to make progress. This module gives you the tools to detect, contain, and recover from every major agent failure mode.

Failure taxonomy — the 5 agent failure modes and how to recognise each
Loop detection — detecting infinite loops, repeated tool calls, lack of progress
Guardrails — output validation, tool call validation, scope enforcement
Cost circuit breakers — hard spending limits that stop runaway agents
Structured agent logging — capturing every decision for debugging and audit
Recovery patterns — graceful degradation, fallback to human, partial result return

🚨

The 5 Agent Failure Modes

Know These

Infinite Loop

Agent calls the same tool repeatedly with same args, never making progress. max_turns doesn't help if the loop is subtle.

Fix: tool call history deduplication, progress detection

Stuck State

Agent keeps trying a failing approach, can't recover. Tool returns error, agent retries with same args, same error.

Fix: error escalation counter, alternative strategy prompt

Hallucinated Tool Calls

Agent invents tool names that don't exist, or calls real tools with nonsensical arguments.

Fix: strict tool registry validation, argument schema enforcement

Runaway Cost

Agent spawns subagents, each calling expensive tools in loops. $0.01 task becomes $100 task.

Fix: cost circuit breaker, per-session spending cap

Silent Partial Failure

Agent completes but with incorrect results. It said it succeeded but actually failed midway. No error raised.

Fix: result validation, structured completion checks, audit log

🔄

Loop Detection and Progress Tracking

Critical

import hashlib, json
from collections import defaultdict, Counter
from dataclasses import dataclass, field
from typing import Any

@dataclass
class AgentGuardian:
    """Monitors agent execution for failure patterns."""
    max_turns:         int   = 20
    max_repeated_calls: int  = 3     # same tool+args N times = loop
    max_errors:        int   = 5     # 5 consecutive errors = stuck
    max_cost_usd:      float = 1.0   # hard spending limit

    turn_count:        int   = 0
    error_count:       int   = 0
    total_cost_usd:    float = 0.0
    tool_call_log:     list  = field(default_factory=list)
    call_counts:       dict  = field(default_factory=lambda: Counter())

    def _call_fingerprint(self, tool_name: str, args: dict) -> str:
        """Hash of tool name + sorted args — detects repeated identical calls."""
        key = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
        return hashlib.md5(key.encode()).hexdigest()[:8]

    def record_tool_call(self, tool_name: str, args: dict,
                         result: Any, tokens_used: int = 0) -> None:
        fp = self._call_fingerprint(tool_name, args)
        self.call_counts[fp] += 1
        self.tool_call_log.append({
            "turn":    self.turn_count,
            "tool":    tool_name,
            "args":    args,
            "fp":      fp,
            "success": "error" not in str(result).lower(),
        })
        if isinstance(result, dict) and not result.get("ok", True):
            self.error_count += 1
        else:
            self.error_count = 0   # reset on success
        cost = tokens_used * (3.00 / 1_000_000)
        self.total_cost_usd += cost

    def check(self) -> tuple[bool, str]:
        """Returns (should_stop, reason). Call before each turn."""
        self.turn_count += 1

        if self.turn_count > self.max_turns:
            return True, f"Max turns exceeded ({self.max_turns})"

        if self.total_cost_usd > self.max_cost_usd:
            return True, f"Cost limit exceeded: ${self.total_cost_usd:.4f} > ${self.max_cost_usd}"

        if self.error_count >= self.max_errors:
            return True, f"Stuck: {self.error_count} consecutive errors"

        for fp, count in self.call_counts.items():
            if count >= self.max_repeated_calls:
                recent = [c for c in self.tool_call_log if c["fp"] == fp][-1]
                return True, f"Loop detected: {recent['tool']} called {count}x with same args"

        return False, ""

# Usage inside agent loop
def guarded_agent(user_message: str) -> dict:
    guardian = AgentGuardian(max_turns=15, max_cost_usd=0.50)
    messages  = [{"role": "user", "content": user_message}]

    while True:
        should_stop, reason = guardian.check()
        if should_stop:
            return {"status": "stopped", "reason": reason,
                    "partial_result": extract_partial_result(messages),
                    "turns_used": guardian.turn_count,
                    "cost_usd": guardian.total_cost_usd}

        response = client.messages.create(model="claude-3-5-sonnet-20241022",
                                           max_tokens=4096, tools=TOOLS, messages=messages)

        if response.stop_reason == "end_turn":
            return {"status": "completed",
                    "answer": response.content[0].text,
                    "turns_used": guardian.turn_count,
                    "cost_usd": guardian.total_cost_usd}

        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                guardian.record_tool_call(block.name, block.input, result,
                                          tokens_used=response.usage.output_tokens)
                tool_results.append({"type": "tool_result",
                                      "tool_use_id": block.id, "content": str(result)})
        messages.append({"role": "user", "content": tool_results})

🛡

Input and Output Guardrails

Validation Layer

# ── Tool call validation ──────────────────────────────
def validate_tool_call(tool_name: str, args: dict) -> tuple[bool, str]:
    """Validate before executing. Returns (is_valid, error_message)."""
    if tool_name not in TOOL_REGISTRY:
        return False, f"Tool {tool_name!r} does not exist. Available: {list(TOOL_REGISTRY)}"

    tool_schema = next(t for t in TOOLS if t["name"] == tool_name)
    required = tool_schema["input_schema"].get("required", [])
    properties = tool_schema["input_schema"].get("properties", {})

    for req_field in required:
        if req_field not in args:
            return False, f"Missing required field: {req_field!r}"

    for field_name, field_val in args.items():
        if field_name not in properties:
            return False, f"Unknown field: {field_name!r}"
        expected_type = properties[field_name].get("type")
        if expected_type == "string" and not isinstance(field_val, str):
            return False, f"{field_name} must be a string, got {type(field_val).__name__}"
        if expected_type == "integer" and not isinstance(field_val, int):
            return False, f"{field_name} must be an integer"

    return True, ""

def execute_tool_safe(tool_name: str, args: dict) -> dict:
    is_valid, error = validate_tool_call(tool_name, args)
    if not is_valid:
        return {"ok": False, "error": "INVALID_TOOL_CALL", "message": error,
                "suggestion": "Check the tool name and argument types before calling again."}
    try:
        result = TOOL_REGISTRY[tool_name](**args)
        return result if isinstance(result, dict) else {"ok": True, "result": result}
    except Exception as e:
        return {"ok": False, "error": "TOOL_EXECUTION_ERROR", "message": str(e)}

# ── Output guardrail ──────────────────────────────────
# Validate the agent's final answer before returning to user
from pydantic import BaseModel
from typing import Optional

class AgentOutputGuardrail(BaseModel):
    is_complete: bool
    has_answer:  bool
    is_on_topic: bool
    issues:      list[str] = []

def validate_agent_output(original_goal: str, output: str) -> AgentOutputGuardrail:
    return instructor_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{"role": "user", "content":
            f"""Validate this agent output against the original goal.

Goal: {original_goal}
Output: {output}

Check: Is the goal addressed? Is there a clear answer? Is it on topic?"""}],
        response_model=AgentOutputGuardrail
    )

💰

Cost Circuit Breakers

Financial Safety

import sqlite3
from datetime import datetime

MODEL_COSTS = {
    "claude-3-5-sonnet-20241022": {"input": 3.0/1e6, "output": 15.0/1e6},
    "claude-3-haiku-20240307":    {"input": 0.25/1e6, "output": 1.25/1e6},
}

class AgentCostCircuitBreaker:
    """Hard spending limits for agent sessions."""
    def __init__(self, session_limit_usd: float = 1.0,
                 daily_limit_usd: float = 10.0,
                 per_tool_call_limit_usd: float = 0.10):
        self.session_limit      = session_limit_usd
        self.daily_limit        = daily_limit_usd
        self.per_tool_call_limit = per_tool_call_limit_usd
        self.session_spend      = 0.0
        self.session_id         = datetime.utcnow().isoformat()

    def _compute_cost(self, model: str, input_tok: int, output_tok: int) -> float:
        prices = MODEL_COSTS.get(model, MODEL_COSTS["claude-3-5-sonnet-20241022"])
        return input_tok * prices["input"] + output_tok * prices["output"]

    def _get_daily_spend(self) -> float:
        today = datetime.utcnow().strftime("%Y-%m-%d")
        with sqlite3.connect("agent_costs.db") as conn:
            conn.execute("CREATE TABLE IF NOT EXISTS costs (ts TEXT, session TEXT, cost REAL)")
            row = conn.execute(
                "SELECT SUM(cost) FROM costs WHERE ts LIKE ?", (f"{today}%",)).fetchone()
        return row[0] or 0.0

    def record_and_check(self, model: str, input_tok: int,
                         output_tok: int) -> tuple[float, bool, str]:
        cost  = self._compute_cost(model, input_tok, output_tok)
        self.session_spend += cost

        with sqlite3.connect("agent_costs.db") as conn:
            conn.execute("INSERT INTO costs VALUES (?,?,?)",
                         (datetime.utcnow().isoformat(), self.session_id, cost))

        daily = self._get_daily_spend()

        if cost > self.per_tool_call_limit:
            return cost, True, f"Single call cost ${cost:.4f} exceeds per-call limit"
        if self.session_spend > self.session_limit:
            return cost, True, f"Session spend ${self.session_spend:.4f} exceeds session limit"
        if daily > self.daily_limit:
            return cost, True, f"Daily spend ${daily:.4f} exceeds daily limit"

        return cost, False, ""

⚠️ Always set a session cost limit for any agent that can spawn subagents or loop. A misconfigured agent that recursively calls expensive tools can exhaust a $100 budget in minutes. The circuit breaker pattern is not optional — it is the difference between a manageable incident and a billing nightmare.

📋

Structured Agent Logging

Audit & Debug

pip install structlog

import structlog, time
from datetime import datetime

# Configure structlog for JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)
logger = structlog.get_logger()

class AgentLogger:
    """Structured logging for agent execution."""
    def __init__(self, session_id: str, goal: str):
        self.session_id = session_id
        self.goal       = goal
        self.turn       = 0
        self.start_time = time.time()
        logger.info("agent_started", session_id=session_id, goal=goal)

    def log_turn(self, stop_reason: str, tools_called: list):
        self.turn += 1
        logger.info("agent_turn", session_id=self.session_id,
                    turn=self.turn, stop_reason=stop_reason,
                    tools_called=tools_called)

    def log_tool_call(self, tool_name: str, args: dict, result: dict,
                      latency_ms: float, cost_usd: float):
        success = result.get("ok", True)
        logger.info("tool_call", session_id=self.session_id, turn=self.turn,
                    tool=tool_name, success=success,
                    latency_ms=round(latency_ms, 1), cost_usd=round(cost_usd, 6),
                    error=result.get("error") if not success else None)

    def log_completion(self, status: str, total_cost_usd: float, answer: str = ""):
        elapsed = round(time.time() - self.start_time, 2)
        logger.info("agent_completed", session_id=self.session_id,
                    status=status, total_turns=self.turn,
                    elapsed_sec=elapsed, total_cost_usd=round(total_cost_usd, 6),
                    answer_length=len(answer))

    def log_failure(self, reason: str, last_tool: str = ""):
        logger.error("agent_failed", session_id=self.session_id,
                     turn=self.turn, reason=reason, last_tool=last_tool)

# Example output (one JSON line per event):
# {"event":"agent_started","session_id":"abc123","goal":"Analyse Q3 sales","level":"info","timestamp":"2024-..."}
# {"event":"tool_call","tool":"search_sales_db","success":true,"latency_ms":124.3,"cost_usd":0.000045,...}
# {"event":"agent_failed","reason":"Loop detected: search_sales_db called 3x with same args",...}

💡 Structured logs are queryable. When you have 10,000 agent runs in production and one fails, you need to find: which session, which turn, which tool, what the exact args were. JSON logs let you grep, jq-filter, and aggregate across millions of events. Unstructured print() statements do not.

🔁

Recovery and Graceful Degradation

Resilience

# ── Pattern 1: Alternative strategy prompt ────────────
# When tool fails N times, inject a prompt asking the agent to try differently

STUCK_RECOVERY_MSG = """You have encountered repeated errors with {tool_name}.
The error was: {error_message}

Please try a different approach:
- Use a different tool if available
- Simplify your query or arguments
- If you cannot complete this subtask, explain what you found so far and skip it

Do NOT call {tool_name} again with the same arguments."""

def inject_recovery_hint(messages: list, tool_name: str, error: str) -> list:
    recovery = STUCK_RECOVERY_MSG.format(tool_name=tool_name, error_message=error)
    messages.append({
        "role": "user",
        "content": [{"type": "text", "text": recovery}]
    })
    return messages

# ── Pattern 2: Partial result extraction ─────────────
# When agent hits limit, extract what it learned before stopping

def extract_partial_result(messages: list) -> str:
    if len(messages) < 2:
        return "No results gathered before timeout."

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=512,
        messages=[
            *messages,
            {"role": "user", "content":
             "Summarise what you have found so far, even if incomplete. Be honest about what's missing."}
        ]
    )
    return response.content[0].text

# ── Pattern 3: Fallback to human ─────────────────────
# When agent cannot proceed, escalate with full context

def escalate_to_human(session_id: str, goal: str, messages: list,
                      failure_reason: str) -> dict:
    partial = extract_partial_result(messages)
    ticket  = {
        "session_id":      session_id,
        "original_goal":   goal,
        "failure_reason":  failure_reason,
        "partial_result":  partial,
        "turns_completed": len([m for m in messages if m["role"] == "assistant"]),
        "escalated_at":    datetime.utcnow().isoformat(),
        "priority":        "high" if "cost" in failure_reason.lower() else "normal"
    }
    create_human_task(ticket)   # your ticketing system
    return {"status": "escalated", "ticket_id": ticket["session_id"],
            "message": "A human agent will continue this task."}

# ── Pattern 4: Checkpoint and resume ─────────────────
# Save progress periodically — resume if agent crashes

import pickle, pathlib

def save_checkpoint(session_id: str, messages: list, state: dict):
    path = pathlib.Path(f".checkpoints/{session_id}.pkl")
    path.parent.mkdir(exist_ok=True)
    with open(path, "wb") as f:
        pickle.dump({"messages": messages, "state": state}, f)

def load_checkpoint(session_id: str) -> dict | None:
    path = pathlib.Path(f".checkpoints/{session_id}.pkl")
    if not path.exists():
        return None
    with open(path, "rb") as f:
        return pickle.load(f)

FREE LEARNING RESOURCES

Type	Resource	Best For
Article	Anthropic: Building Effective Agents — anthropic.com/research	Covers agent failure modes and the importance of minimal footprint and human oversight.
Library	structlog — structlog.org — structured logging for Python	The standard library for structured JSON logging in Python. Read the Getting Started guide.
Docs	LangGraph: Checkpointing — langchain-ai.github.io/langgraph	LangGraph's built-in checkpoint system for agent state persistence and recovery.

🛠 Hardened Agent with Full Failure Handling [Advanced] 3–4 days

Take your M19 research agent and add the full production hardening layer from this module.

Requirements

AgentGuardian — loop detection via tool call fingerprinting, max_turns, consecutive error counter
AgentCostCircuitBreaker — session limit ($1), daily limit ($10), per-call limit ($0.10)
Tool validation — validate all tool names and arg types before execution
Structured logging — every turn, tool call, failure, and completion logged as JSON
Recovery hints — inject alternative strategy prompt after 3 consecutive tool errors
Partial result extraction — on any stop (limit/loop/cost), extract and return what was learned
Checkpoint/resume — save state after each turn, auto-resume if session_id provided

Testing

Trigger every failure mode deliberately and verify each guard works
Run 10 real tasks and review the structured logs — identify any unexpected failure patterns

Skills: AgentGuardian, cost circuit breaker, tool validation, structlog, recovery patterns, checkpoint/resume

LAB 1

Trigger and Detect Every Failure Mode

Objective: Deliberately trigger all 5 failure modes and verify the AgentGuardian catches each one.

Build an agent with AgentGuardian (max_turns=10, max_repeated_calls=3, max_errors=4, max_cost=$0.50).

Trigger Infinite Loop: make a tool that always returns "retry" and never changes state. Verify guardian catches it at 3 repeated calls.

Trigger Stuck State: make a tool always return an error dict. Verify guardian catches it at 4 consecutive errors.

Trigger Hallucinated Tool: remove a tool from the registry but leave it in the description. Verify execute_tool_safe catches and returns structured error.

Trigger Runaway Cost: set max_cost=$0.001 and run any real query. Verify circuit breaker fires after first turn.

For each triggered failure: verify the agent returns a useful partial_result, not a Python exception. Document the structured log output for each.

LAB 2

Structured Log Analysis

Objective: Practice querying structured logs to diagnose agent failures post-hoc.

Run your hardened agent on 20 different tasks. All logs go to a file agent.jsonl (one JSON object per line).

Write Python to parse agent.jsonl and compute: (a) total sessions, (b) success rate, (c) most called tools, (d) most common failure reason, (e) avg turns per successful session.

Find all sessions where a specific tool failed. Print: session_id, turn number, args passed, error message. This is the debugging workflow you'd use in production.

Identify the most expensive session. Reconstruct its full tool call sequence from the logs. What did it do that cost the most?

LAB 3

Checkpoint and Resume

Objective: Verify that checkpointing allows agent recovery from crashes without losing work.

Add checkpoint saving after every tool call in your agent. Use session_id as filename.

Start a long-running 10-turn task. After turn 5, forcefully kill the process (Ctrl+C or sys.exit()).

Restart the agent with the same session_id. Verify it loads from the checkpoint and continues from turn 6 — it should not redo turns 1-5.

Verify the final answer matches what you would have gotten without the interruption. Compare cost: checkpoint run should cost ~50% of a full restart.

P6-M21 MASTERY CHECKLIST

Can name all 5 agent failure modes: infinite loop, stuck state, hallucinated tool calls, runaway cost, silent partial failure
Can implement tool call fingerprinting using content hash to detect repeated identical calls
Can implement AgentGuardian that checks max_turns, max_repeated_calls, consecutive errors, and cost before every turn
All agents return structured results on failure — never Python exceptions propagating to the user
Can validate tool name and argument types before execution using the tool's JSON schema
Can validate agent output against the original goal using a cheap LLM checker
Can implement cost circuit breaker with session, daily, and per-call limits using SQLite
Can set up structlog for JSON-structured logging with turn, tool, cost, and latency fields
Can implement the recovery hint pattern: inject alternative strategy prompt after repeated errors
Can extract a partial result from conversation history when an agent hits a limit
Can escalate to human with a structured ticket containing partial results and failure context
Can implement checkpoint/resume with pickle or LangGraph's built-in checkpointer
Can query JSONL structured logs to compute success rate, failure distribution, and most-called tools
Completed Lab 1: all 5 failure modes triggered and verified
Completed Lab 2: structured log analysis with success rate and failure debugging
Completed Lab 3: checkpoint/resume verified end-to-end
Milestone project: hardened agent with all guards pushed to GitHub

✅ When complete: Move to P6-M22 — Evaluation Harnesses. You now have agents that fail safely. M22 covers how to measure and improve agent quality systematically with DeepEval, Ragas, and LLM-as-judge.

← P6-M20: Tool Design 🗺️ All Modules Next: P6-M22 — Evaluation Harnesses →