Part 6 — Agents, Workflows & Evaluation  ·  Module 21 of 22
Failure Handling in Agents
Loops, stuck states, hallucinated tool calls, runaway costs — and how to handle all of them
⏱ 1 Week 🟠 Intermediate–Advanced 🔧 LangGraph · Tenacity · Structlog 📋 Prerequisite: P6-M20
🎯

What This Module Covers

Production Hardening

Agents fail in ways chains never do. They can loop forever, call tools that don't exist, spend your entire monthly API budget in 10 minutes, or get stuck unable to make progress. This module gives you the tools to detect, contain, and recover from every major agent failure mode.

  • Failure taxonomy — the 5 agent failure modes and how to recognise each
  • Loop detection — detecting infinite loops, repeated tool calls, lack of progress
  • Guardrails — output validation, tool call validation, scope enforcement
  • Cost circuit breakers — hard spending limits that stop runaway agents
  • Structured agent logging — capturing every decision for debugging and audit
  • Recovery patterns — graceful degradation, fallback to human, partial result return
🚨

The 5 Agent Failure Modes

Know These

Infinite Loop

Agent calls the same tool repeatedly with same args, never making progress. max_turns doesn't help if the loop is subtle.

Fix: tool call history deduplication, progress detection

Stuck State

Agent keeps trying a failing approach, can't recover. Tool returns error, agent retries with same args, same error.

Fix: error escalation counter, alternative strategy prompt

Hallucinated Tool Calls

Agent invents tool names that don't exist, or calls real tools with nonsensical arguments.

Fix: strict tool registry validation, argument schema enforcement

Runaway Cost

Agent spawns subagents, each calling expensive tools in loops. $0.01 task becomes $100 task.

Fix: cost circuit breaker, per-session spending cap

Silent Partial Failure

Agent completes but with incorrect results. It said it succeeded but actually failed midway. No error raised.

Fix: result validation, structured completion checks, audit log
🔄

Loop Detection and Progress Tracking

Critical
import hashlib, json
from collections import defaultdict, Counter
from dataclasses import dataclass, field
from typing import Any

@dataclass
class AgentGuardian:
    """Monitors agent execution for failure patterns."""
    max_turns:         int   = 20
    max_repeated_calls: int  = 3     # same tool+args N times = loop
    max_errors:        int   = 5     # 5 consecutive errors = stuck
    max_cost_usd:      float = 1.0   # hard spending limit

    turn_count:        int   = 0
    error_count:       int   = 0
    total_cost_usd:    float = 0.0
    tool_call_log:     list  = field(default_factory=list)
    call_counts:       dict  = field(default_factory=lambda: Counter())

    def _call_fingerprint(self, tool_name: str, args: dict) -> str:
        """Hash of tool name + sorted args — detects repeated identical calls."""
        key = json.dumps({"tool": tool_name, "args": args}, sort_keys=True)
        return hashlib.md5(key.encode()).hexdigest()[:8]

    def record_tool_call(self, tool_name: str, args: dict,
                         result: Any, tokens_used: int = 0) -> None:
        fp = self._call_fingerprint(tool_name, args)
        self.call_counts[fp] += 1
        self.tool_call_log.append({
            "turn":    self.turn_count,
            "tool":    tool_name,
            "args":    args,
            "fp":      fp,
            "success": "error" not in str(result).lower(),
        })
        if isinstance(result, dict) and not result.get("ok", True):
            self.error_count += 1
        else:
            self.error_count = 0   # reset on success
        cost = tokens_used * (3.00 / 1_000_000)
        self.total_cost_usd += cost

    def check(self) -> tuple[bool, str]:
        """Returns (should_stop, reason). Call before each turn."""
        self.turn_count += 1

        if self.turn_count > self.max_turns:
            return True, f"Max turns exceeded ({self.max_turns})"

        if self.total_cost_usd > self.max_cost_usd:
            return True, f"Cost limit exceeded: ${self.total_cost_usd:.4f} > ${self.max_cost_usd}"

        if self.error_count >= self.max_errors:
            return True, f"Stuck: {self.error_count} consecutive errors"

        for fp, count in self.call_counts.items():
            if count >= self.max_repeated_calls:
                recent = [c for c in self.tool_call_log if c["fp"] == fp][-1]
                return True, f"Loop detected: {recent['tool']} called {count}x with same args"

        return False, ""

# Usage inside agent loop
def guarded_agent(user_message: str) -> dict:
    guardian = AgentGuardian(max_turns=15, max_cost_usd=0.50)
    messages  = [{"role": "user", "content": user_message}]

    while True:
        should_stop, reason = guardian.check()
        if should_stop:
            return {"status": "stopped", "reason": reason,
                    "partial_result": extract_partial_result(messages),
                    "turns_used": guardian.turn_count,
                    "cost_usd": guardian.total_cost_usd}

        response = client.messages.create(model="claude-3-5-sonnet-20241022",
                                           max_tokens=4096, tools=TOOLS, messages=messages)

        if response.stop_reason == "end_turn":
            return {"status": "completed",
                    "answer": response.content[0].text,
                    "turns_used": guardian.turn_count,
                    "cost_usd": guardian.total_cost_usd}

        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                guardian.record_tool_call(block.name, block.input, result,
                                          tokens_used=response.usage.output_tokens)
                tool_results.append({"type": "tool_result",
                                      "tool_use_id": block.id, "content": str(result)})
        messages.append({"role": "user", "content": tool_results})
🛡

Input and Output Guardrails

Validation Layer
# ── Tool call validation ──────────────────────────────
def validate_tool_call(tool_name: str, args: dict) -> tuple[bool, str]:
    """Validate before executing. Returns (is_valid, error_message)."""
    if tool_name not in TOOL_REGISTRY:
        return False, f"Tool {tool_name!r} does not exist. Available: {list(TOOL_REGISTRY)}"

    tool_schema = next(t for t in TOOLS if t["name"] == tool_name)
    required = tool_schema["input_schema"].get("required", [])
    properties = tool_schema["input_schema"].get("properties", {})

    for req_field in required:
        if req_field not in args:
            return False, f"Missing required field: {req_field!r}"

    for field_name, field_val in args.items():
        if field_name not in properties:
            return False, f"Unknown field: {field_name!r}"
        expected_type = properties[field_name].get("type")
        if expected_type == "string" and not isinstance(field_val, str):
            return False, f"{field_name} must be a string, got {type(field_val).__name__}"
        if expected_type == "integer" and not isinstance(field_val, int):
            return False, f"{field_name} must be an integer"

    return True, ""

def execute_tool_safe(tool_name: str, args: dict) -> dict:
    is_valid, error = validate_tool_call(tool_name, args)
    if not is_valid:
        return {"ok": False, "error": "INVALID_TOOL_CALL", "message": error,
                "suggestion": "Check the tool name and argument types before calling again."}
    try:
        result = TOOL_REGISTRY[tool_name](**args)
        return result if isinstance(result, dict) else {"ok": True, "result": result}
    except Exception as e:
        return {"ok": False, "error": "TOOL_EXECUTION_ERROR", "message": str(e)}

# ── Output guardrail ──────────────────────────────────
# Validate the agent's final answer before returning to user
from pydantic import BaseModel
from typing import Optional

class AgentOutputGuardrail(BaseModel):
    is_complete: bool
    has_answer:  bool
    is_on_topic: bool
    issues:      list[str] = []

def validate_agent_output(original_goal: str, output: str) -> AgentOutputGuardrail:
    return instructor_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{"role": "user", "content":
            f"""Validate this agent output against the original goal.

Goal: {original_goal}
Output: {output}

Check: Is the goal addressed? Is there a clear answer? Is it on topic?"""}],
        response_model=AgentOutputGuardrail
    )
💰

Cost Circuit Breakers

Financial Safety
import sqlite3
from datetime import datetime

MODEL_COSTS = {
    "claude-3-5-sonnet-20241022": {"input": 3.0/1e6, "output": 15.0/1e6},
    "claude-3-haiku-20240307":    {"input": 0.25/1e6, "output": 1.25/1e6},
}

class AgentCostCircuitBreaker:
    """Hard spending limits for agent sessions."""
    def __init__(self, session_limit_usd: float = 1.0,
                 daily_limit_usd: float = 10.0,
                 per_tool_call_limit_usd: float = 0.10):
        self.session_limit      = session_limit_usd
        self.daily_limit        = daily_limit_usd
        self.per_tool_call_limit = per_tool_call_limit_usd
        self.session_spend      = 0.0
        self.session_id         = datetime.utcnow().isoformat()

    def _compute_cost(self, model: str, input_tok: int, output_tok: int) -> float:
        prices = MODEL_COSTS.get(model, MODEL_COSTS["claude-3-5-sonnet-20241022"])
        return input_tok * prices["input"] + output_tok * prices["output"]

    def _get_daily_spend(self) -> float:
        today = datetime.utcnow().strftime("%Y-%m-%d")
        with sqlite3.connect("agent_costs.db") as conn:
            conn.execute("CREATE TABLE IF NOT EXISTS costs (ts TEXT, session TEXT, cost REAL)")
            row = conn.execute(
                "SELECT SUM(cost) FROM costs WHERE ts LIKE ?", (f"{today}%",)).fetchone()
        return row[0] or 0.0

    def record_and_check(self, model: str, input_tok: int,
                         output_tok: int) -> tuple[float, bool, str]:
        cost  = self._compute_cost(model, input_tok, output_tok)
        self.session_spend += cost

        with sqlite3.connect("agent_costs.db") as conn:
            conn.execute("INSERT INTO costs VALUES (?,?,?)",
                         (datetime.utcnow().isoformat(), self.session_id, cost))

        daily = self._get_daily_spend()

        if cost > self.per_tool_call_limit:
            return cost, True, f"Single call cost ${cost:.4f} exceeds per-call limit"
        if self.session_spend > self.session_limit:
            return cost, True, f"Session spend ${self.session_spend:.4f} exceeds session limit"
        if daily > self.daily_limit:
            return cost, True, f"Daily spend ${daily:.4f} exceeds daily limit"

        return cost, False, ""

⚠️ Always set a session cost limit for any agent that can spawn subagents or loop. A misconfigured agent that recursively calls expensive tools can exhaust a $100 budget in minutes. The circuit breaker pattern is not optional — it is the difference between a manageable incident and a billing nightmare.

📋

Structured Agent Logging

Audit & Debug
pip install structlog

import structlog, time
from datetime import datetime

# Configure structlog for JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)
logger = structlog.get_logger()

class AgentLogger:
    """Structured logging for agent execution."""
    def __init__(self, session_id: str, goal: str):
        self.session_id = session_id
        self.goal       = goal
        self.turn       = 0
        self.start_time = time.time()
        logger.info("agent_started", session_id=session_id, goal=goal)

    def log_turn(self, stop_reason: str, tools_called: list):
        self.turn += 1
        logger.info("agent_turn", session_id=self.session_id,
                    turn=self.turn, stop_reason=stop_reason,
                    tools_called=tools_called)

    def log_tool_call(self, tool_name: str, args: dict, result: dict,
                      latency_ms: float, cost_usd: float):
        success = result.get("ok", True)
        logger.info("tool_call", session_id=self.session_id, turn=self.turn,
                    tool=tool_name, success=success,
                    latency_ms=round(latency_ms, 1), cost_usd=round(cost_usd, 6),
                    error=result.get("error") if not success else None)

    def log_completion(self, status: str, total_cost_usd: float, answer: str = ""):
        elapsed = round(time.time() - self.start_time, 2)
        logger.info("agent_completed", session_id=self.session_id,
                    status=status, total_turns=self.turn,
                    elapsed_sec=elapsed, total_cost_usd=round(total_cost_usd, 6),
                    answer_length=len(answer))

    def log_failure(self, reason: str, last_tool: str = ""):
        logger.error("agent_failed", session_id=self.session_id,
                     turn=self.turn, reason=reason, last_tool=last_tool)

# Example output (one JSON line per event):
# {"event":"agent_started","session_id":"abc123","goal":"Analyse Q3 sales","level":"info","timestamp":"2024-..."}
# {"event":"tool_call","tool":"search_sales_db","success":true,"latency_ms":124.3,"cost_usd":0.000045,...}
# {"event":"agent_failed","reason":"Loop detected: search_sales_db called 3x with same args",...}

💡 Structured logs are queryable. When you have 10,000 agent runs in production and one fails, you need to find: which session, which turn, which tool, what the exact args were. JSON logs let you grep, jq-filter, and aggregate across millions of events. Unstructured print() statements do not.

🔁

Recovery and Graceful Degradation

Resilience
# ── Pattern 1: Alternative strategy prompt ────────────
# When tool fails N times, inject a prompt asking the agent to try differently

STUCK_RECOVERY_MSG = """You have encountered repeated errors with {tool_name}.
The error was: {error_message}

Please try a different approach:
- Use a different tool if available
- Simplify your query or arguments
- If you cannot complete this subtask, explain what you found so far and skip it

Do NOT call {tool_name} again with the same arguments."""

def inject_recovery_hint(messages: list, tool_name: str, error: str) -> list:
    recovery = STUCK_RECOVERY_MSG.format(tool_name=tool_name, error_message=error)
    messages.append({
        "role": "user",
        "content": [{"type": "text", "text": recovery}]
    })
    return messages

# ── Pattern 2: Partial result extraction ─────────────
# When agent hits limit, extract what it learned before stopping

def extract_partial_result(messages: list) -> str:
    if len(messages) < 2:
        return "No results gathered before timeout."

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=512,
        messages=[
            *messages,
            {"role": "user", "content":
             "Summarise what you have found so far, even if incomplete. Be honest about what's missing."}
        ]
    )
    return response.content[0].text

# ── Pattern 3: Fallback to human ─────────────────────
# When agent cannot proceed, escalate with full context

def escalate_to_human(session_id: str, goal: str, messages: list,
                      failure_reason: str) -> dict:
    partial = extract_partial_result(messages)
    ticket  = {
        "session_id":      session_id,
        "original_goal":   goal,
        "failure_reason":  failure_reason,
        "partial_result":  partial,
        "turns_completed": len([m for m in messages if m["role"] == "assistant"]),
        "escalated_at":    datetime.utcnow().isoformat(),
        "priority":        "high" if "cost" in failure_reason.lower() else "normal"
    }
    create_human_task(ticket)   # your ticketing system
    return {"status": "escalated", "ticket_id": ticket["session_id"],
            "message": "A human agent will continue this task."}

# ── Pattern 4: Checkpoint and resume ─────────────────
# Save progress periodically — resume if agent crashes

import pickle, pathlib

def save_checkpoint(session_id: str, messages: list, state: dict):
    path = pathlib.Path(f".checkpoints/{session_id}.pkl")
    path.parent.mkdir(exist_ok=True)
    with open(path, "wb") as f:
        pickle.dump({"messages": messages, "state": state}, f)

def load_checkpoint(session_id: str) -> dict | None:
    path = pathlib.Path(f".checkpoints/{session_id}.pkl")
    if not path.exists():
        return None
    with open(path, "rb") as f:
        return pickle.load(f)

FREE LEARNING RESOURCES

TypeResourceBest For
ArticleAnthropic: Building Effective Agents — anthropic.com/researchCovers agent failure modes and the importance of minimal footprint and human oversight.
Librarystructlog — structlog.org — structured logging for PythonThe standard library for structured JSON logging in Python. Read the Getting Started guide.
DocsLangGraph: Checkpointing — langchain-ai.github.io/langgraphLangGraph's built-in checkpoint system for agent state persistence and recovery.
🛠 Hardened Agent with Full Failure Handling [Advanced] 3–4 days

Take your M19 research agent and add the full production hardening layer from this module.

Requirements

  • AgentGuardian — loop detection via tool call fingerprinting, max_turns, consecutive error counter
  • AgentCostCircuitBreaker — session limit ($1), daily limit ($10), per-call limit ($0.10)
  • Tool validation — validate all tool names and arg types before execution
  • Structured logging — every turn, tool call, failure, and completion logged as JSON
  • Recovery hints — inject alternative strategy prompt after 3 consecutive tool errors
  • Partial result extraction — on any stop (limit/loop/cost), extract and return what was learned
  • Checkpoint/resume — save state after each turn, auto-resume if session_id provided

Testing

  • Trigger every failure mode deliberately and verify each guard works
  • Run 10 real tasks and review the structured logs — identify any unexpected failure patterns

Skills: AgentGuardian, cost circuit breaker, tool validation, structlog, recovery patterns, checkpoint/resume

LAB 1

Trigger and Detect Every Failure Mode

Objective: Deliberately trigger all 5 failure modes and verify the AgentGuardian catches each one.

1
Build an agent with AgentGuardian (max_turns=10, max_repeated_calls=3, max_errors=4, max_cost=$0.50).
2
Trigger Infinite Loop: make a tool that always returns "retry" and never changes state. Verify guardian catches it at 3 repeated calls.
3
Trigger Stuck State: make a tool always return an error dict. Verify guardian catches it at 4 consecutive errors.
4
Trigger Hallucinated Tool: remove a tool from the registry but leave it in the description. Verify execute_tool_safe catches and returns structured error.
5
Trigger Runaway Cost: set max_cost=$0.001 and run any real query. Verify circuit breaker fires after first turn.
6
For each triggered failure: verify the agent returns a useful partial_result, not a Python exception. Document the structured log output for each.
LAB 2

Structured Log Analysis

Objective: Practice querying structured logs to diagnose agent failures post-hoc.

1
Run your hardened agent on 20 different tasks. All logs go to a file agent.jsonl (one JSON object per line).
2
Write Python to parse agent.jsonl and compute: (a) total sessions, (b) success rate, (c) most called tools, (d) most common failure reason, (e) avg turns per successful session.
3
Find all sessions where a specific tool failed. Print: session_id, turn number, args passed, error message. This is the debugging workflow you'd use in production.
4
Identify the most expensive session. Reconstruct its full tool call sequence from the logs. What did it do that cost the most?
LAB 3

Checkpoint and Resume

Objective: Verify that checkpointing allows agent recovery from crashes without losing work.

1
Add checkpoint saving after every tool call in your agent. Use session_id as filename.
2
Start a long-running 10-turn task. After turn 5, forcefully kill the process (Ctrl+C or sys.exit()).
3
Restart the agent with the same session_id. Verify it loads from the checkpoint and continues from turn 6 — it should not redo turns 1-5.
4
Verify the final answer matches what you would have gotten without the interruption. Compare cost: checkpoint run should cost ~50% of a full restart.

P6-M21 MASTERY CHECKLIST

When complete: Move to P6-M22 — Evaluation Harnesses. You now have agents that fail safely. M22 covers how to measure and improve agent quality systematically with DeepEval, Ragas, and LLM-as-judge.

← P6-M20: Tool Design 🗺️ All Modules Next: P6-M22 — Evaluation Harnesses →