P6-M20 - Tool Design, Workflow Patterns & When NOT to Use Agents

Part 6 — Agents, Workflows & Evaluation · Module 20 of 22

Tool Design, Workflow Patterns & When NOT to Use Agents

Design reliable tools, pick the right workflow pattern, and know when simpler is better

⏱ 1 Week 🟠 Intermediate–Advanced 🔧 LangGraph · FastAPI · Anthropic 📋 Prerequisite: P6-M19

🎯

What This Module Covers

Design Thinking

Building agents that work in a notebook is easy. Building agents that work reliably in production is hard. This module covers the engineering judgment that separates toy agents from production ones: how to design tools that are reliable, how to choose the right workflow architecture, and critically — when a simple chain beats a complex agent every time.

Tool design principles — idempotency, error contracts, atomicity, what makes a good vs bad tool
The five workflow patterns — prompt chaining, routing, parallelisation, orchestrator-subagent, evaluator-optimizer
When NOT to use agents — the decision matrix that saves you from over-engineering
Parallel workflows — fan-out/fan-in patterns, when to parallelise, how to handle partial failures
Orchestrator-subagent — breaking complex goals into specialised sub-agents with handoff

🔧

What Makes a Good Tool

Design Principles

A tool is the interface between your agent and the real world. Bad tool design is the #1 source of agent failures — not the LLM, not the prompting.

# ── PRINCIPLE 1: Idempotent tools ────────────────────
# If the agent calls a tool twice with the same args, the result should be the same
# and no duplicate side effects should occur

# BAD: calling twice creates two records
def create_ticket(title: str, description: str) -> dict:
    return db.insert("tickets", {"title": title, "description": description})

# GOOD: upsert on a natural key — safe to call multiple times
def create_or_get_ticket(title: str, description: str) -> dict:
    existing = db.find_one("tickets", {"title": title})
    if existing:
        return existing
    return db.insert("tickets", {"title": title, "description": description})

# ── PRINCIPLE 2: Explicit error contracts ────────────
# Never raise exceptions — return structured errors the agent can understand

# BAD: agent receives an unhandled exception, gets confused
def get_user(user_id: str) -> dict:
    return db.get("users", user_id)   # raises KeyError if not found

# GOOD: structured error the agent can reason about
def get_user(user_id: str) -> dict:
    user = db.find_one("users", {"id": user_id})
    if not user:
        return {"error": "USER_NOT_FOUND",
                "message": f"No user with id '{user_id}'",
                "suggestion": "Try searching by email with search_users()"}
    return {"success": True, "user": user}

# ── PRINCIPLE 3: Atomic operations ───────────────────
# One tool should do ONE thing — not a chain of things

# BAD: one tool does too much — partial failures are unrecoverable
def process_order(order_id: str) -> dict:
    validate_stock()
    charge_payment()
    send_confirmation_email()
    update_inventory()

# GOOD: separate tools, agent orchestrates the sequence
def validate_stock(items: list) -> dict: ...
def charge_payment(amount: float, card_id: str) -> dict: ...
def send_confirmation_email(order_id: str, email: str) -> dict: ...
def update_inventory(items: list, delta: int) -> dict: ...

📝

Tool Description Engineering

Selection Precision

# The description determines WHEN the agent calls the tool.
# Bad descriptions → wrong tool selection → wrong results.

# ── Pattern: Use When / Don't Use When ───────────────
SEARCH_TOOL = {
    "name": "search_knowledge_base",
    "description": """Search the internal knowledge base for product documentation,
API references, and troubleshooting guides.

USE when:
- User asks about product features, configuration, or known issues
- User needs step-by-step instructions from documentation
- User references a specific version or release note

DO NOT USE when:
- Question is about general programming (use your training knowledge)
- Question requires real-time data (use get_live_status instead)
- Question is a math calculation (use calculate instead)""",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query. Be specific. Example: 'how to configure DPDK hugepages on Linux'"
            },
            "version": {
                "type": "string",
                "description": "Optionally filter by product version, e.g. '23.11'. Omit for all versions."
            }
        },
        "required": ["query"]
    }
}

# ── Consistent return schema ──────────────────────────
# All tools should return a dict with consistent keys
# so the agent can reliably check for success/failure

def tool_success(data: dict, message: str = "") -> dict:
    return {"ok": True, "data": data, "message": message}

def tool_error(code: str, message: str, suggestion: str = "") -> dict:
    return {"ok": False, "error_code": code,
            "message": message, "suggestion": suggestion}

💡 Tool names are critical. search is ambiguous — the agent doesn't know what it searches. search_knowledge_base, search_web, search_customer_records are unambiguous. When you have multiple search tools, the names must make the distinction obvious without reading the description.

🔒

Tool Safety — Scope Limiting and Validation

Production

# Scope limiting — tools should only do what they say
def query_database(sql: str, allowed_tables: list[str] = None) -> dict:
    # Validate it's a SELECT (never allow INSERT/DELETE/UPDATE from agent)
    if not sql.strip().upper().startswith("SELECT"):
        return tool_error("FORBIDDEN_OPERATION",
                          "Only SELECT queries are allowed",
                          "Use write_record() for data modification")

    # Validate only allowed tables are accessed
    if allowed_tables:
        import re
        tables_in_query = re.findall(r'FROM\s+(\w+)', sql, re.IGNORECASE)
        for t in tables_in_query:
            if t not in allowed_tables:
                return tool_error("TABLE_NOT_ALLOWED",
                                  f"Table {t!r} not in allowed list: {allowed_tables}")
    try:
        results = db.execute(sql)
        return tool_success({"rows": results, "count": len(results)})
    except Exception as e:
        return tool_error("QUERY_ERROR", str(e))

# Rate limiting per tool to prevent runaway agents
from collections import defaultdict
import time

_tool_calls = defaultdict(list)
RATE_LIMITS = {"search_web": (10, 60)}   # 10 calls per 60 seconds

def rate_limit_check(tool_name: str) -> bool:
    if tool_name not in RATE_LIMITS:
        return True
    max_calls, window = RATE_LIMITS[tool_name]
    now = time.time()
    calls = [t for t in _tool_calls[tool_name] if now - t < window]
    _tool_calls[tool_name] = calls
    if len(calls) >= max_calls:
        return False
    _tool_calls[tool_name].append(now)
    return True

🗺

The Five Workflow Patterns

Architecture Toolkit

These five patterns cover 90% of real AI system architectures. Knowing them prevents you from reaching for a full agent when a simpler pattern will do.

Prompt Chaining

LLM output of step N feeds as input to step N+1. Each step does one thing well.

Use: linear multi-step tasks, document pipelines

Routing

A classifier LLM routes input to one of several specialised handlers. Each handler is optimised for its class.

Use: multi-category support, mixed content types

Parallelisation

Multiple LLM calls run concurrently on the same input. Results are aggregated (voting or merge).

Use: independent subtasks, multi-perspective analysis

Orchestrator-Subagent

A planning LLM breaks the task into subtasks and dispatches to specialised subagents. Results are synthesised.

Use: complex multi-domain tasks, large research jobs

Evaluator-Optimizer

One LLM generates output, another evaluates quality and provides feedback for improvement. Loops until quality threshold met.

Use: code generation, content quality requirements

🔗

Prompt Chaining — Implementation

Most Common

# Prompt chaining: clean, testable, each step independently improvable
# Each gate() call validates before passing to the next step

def chain_extract_summarise_translate(document: str, target_lang: str) -> dict:
    # Step 1: Extract key facts
    facts = call_llm(
        system="Extract the 5 most important factual claims from this document. Output as a numbered list.",
        user=document
    )
    if not facts:
        return {"error": "Extraction failed"}

    # Step 2: Summarise the facts
    summary = call_llm(
        system="Write a 2-3 sentence executive summary based on these key facts.",
        user=facts
    )

    # Step 3: Translate (only if not English)
    if target_lang.lower() not in ("en", "english"):
        translated = call_llm(
            system=f"Translate to {target_lang}. Maintain tone and technical terms.",
            user=summary
        )
    else:
        translated = summary

    return {"facts": facts, "summary": summary, "translated": translated}

# Evaluator-Optimizer pattern
def generate_with_quality_loop(prompt: str, max_iterations: int = 3) -> str:
    output = call_llm(system="Generate a response.", user=prompt)

    for i in range(max_iterations):
        evaluation = call_llm(
            system="""Evaluate this output for: accuracy, completeness, clarity.
Return JSON: {"score": 1-10, "issues": [...], "passed": bool}""",
            user=f"Original prompt: {prompt}\n\nOutput: {output}"
        )
        import json
        result = json.loads(evaluation)
        if result.get("passed") or result.get("score", 0) >= 8:
            break

        # Regenerate with feedback
        output = call_llm(
            system="Improve the output based on this feedback.",
            user=f"Previous output: {output}\n\nIssues: {result['issues']}"
        )
    return output

🔀

Routing Pattern — LLM as Classifier

Scalable

from typing import Literal
from pydantic import BaseModel
import instructor, anthropic

instr_client = instructor.from_anthropic(anthropic.Anthropic())

class RouteDecision(BaseModel):
    category: Literal["billing", "technical", "general", "complaint"]
    confidence: float
    reasoning: str

def route_support_ticket(ticket: str) -> RouteDecision:
    return instr_client.messages.create(
        model="claude-3-haiku-20240307",   # cheap model for routing
        max_tokens=100,
        messages=[{"role": "user",
                   "content": f"Classify this support ticket:\n\n{ticket}"}],
        response_model=RouteDecision
    )

# Specialised handlers — each optimised for its category
HANDLERS = {
    "billing":   handle_billing_ticket,
    "technical": handle_technical_ticket,
    "general":   handle_general_ticket,
    "complaint": handle_complaint_ticket,
}

def process_ticket(ticket: str) -> dict:
    route = route_support_ticket(ticket)
    handler = HANDLERS[route.category]
    return handler(ticket)

🚫

When NOT to Use Agents — The Decision Matrix

Most Important Lesson

The most common mistake in AI engineering is reaching for agents when a simpler architecture would be more reliable, cheaper, and faster to debug. Agents introduce non-determinism — every additional LLM decision is a point of potential failure.

Situation	Use Agent?	Better Alternative
Steps are always the same	✗ No	Prompt chain — deterministic, testable
Steps depend on content classification	✗ No	Routing — LLM classifier + fixed handlers
Independent subtasks on same input	✗ No	Parallelisation — asyncio.gather()
Single API call answers the question	✗ No	Simple function call or RAG query
Steps are known, but order varies by input	✗ No	Routing with multiple fixed chains
Task requires dynamic tool selection	✓ Yes	—
Number of steps not known in advance	✓ Yes	—
Task requires reasoning about partial results	✓ Yes	—
Task spans multiple API/DB systems dynamically	✓ Yes	—

# The "do I need an agent?" test — ask these questions in order:
#
# 1. Can I write the steps as a fixed Python function?
#    YES → use a chain or function call. NOT an agent.
#
# 2. Do the steps vary, but can I enumerate all the variations?
#    YES → use routing. NOT an agent.
#
# 3. Are the subtasks independent and can run in parallel?
#    YES → use asyncio.gather(). NOT an agent.
#
# 4. Is the sequence truly unpredictable until you see the data?
#    YES → now consider an agent.
#
# If you reach question 4 — also ask:
# - Can I tolerate non-determinism in production?
# - Do I have evaluation/monitoring to catch failures?
# - Is the latency and cost of multi-turn LLM reasoning acceptable?

⚠️ Agents are harder to test, harder to debug, more expensive, and slower than deterministic pipelines. Every additional LLM call is a potential point of failure, cost, and latency. Anthropic's own guidelines say: augment agents with workflows wherever possible, and only add true autonomy where it is genuinely necessary.

⚡

Parallel Workflows — Fan-Out / Fan-In

Performance Pattern

import asyncio, anthropic

async_client = anthropic.AsyncAnthropic()

async def call_llm_async(system: str, user: str) -> str:
    response = await async_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": user}],
        system=system
    )
    return response.content[0].text

# ── Pattern 1: Same input, multiple perspectives ──────
async def multi_perspective_review(code: str) -> dict:
    security, performance, readability = await asyncio.gather(
        call_llm_async("Review this code for security vulnerabilities only.", code),
        call_llm_async("Review this code for performance issues only.", code),
        call_llm_async("Review this code for readability and maintainability only.", code),
    )
    # Synthesise all three perspectives
    synthesis = await call_llm_async(
        "Combine these three code reviews into a single prioritised action list.",
        f"Security:\n{security}\n\nPerformance:\n{performance}\n\nReadability:\n{readability}"
    )
    return {"security": security, "performance": performance,
            "readability": readability, "synthesis": synthesis}

# ── Pattern 2: Different inputs, same processing ──────
async def process_documents_parallel(documents: list[str]) -> list[str]:
    summaries = await asyncio.gather(
        *[call_llm_async("Summarise in 2 sentences.", doc) for doc in documents]
    )
    return list(summaries)

# ── Pattern 3: Voting — run N times, take majority ────
async def classify_with_voting(text: str, n: int = 3) -> str:
    from collections import Counter
    labels = await asyncio.gather(
        *[call_llm_async(
            "Classify as POSITIVE, NEGATIVE, or NEUTRAL. Reply with one word only.", text
          ) for _ in range(n)]
    )
    labels = [l.strip().upper() for l in labels]
    return Counter(labels).most_common(1)[0][0]

# ── Handling partial failures ──────────────────────────
async def gather_with_fallback(coroutines: list) -> list:
    results = await asyncio.gather(*coroutines, return_exceptions=True)
    processed = []
    for r in results:
        if isinstance(r, Exception):
            processed.append({"error": str(r)})
        else:
            processed.append(r)
    return processed

🎯

Orchestrator-Subagent Pattern

Complex Tasks

For complex tasks that span multiple domains (research + analysis + writing), an orchestrator LLM plans and dispatches to specialised subagents. Each subagent has its own tools and system prompt optimised for its domain.

from pydantic import BaseModel
from typing import List, Literal
import instructor, anthropic, asyncio

instr_client = instructor.from_anthropic(anthropic.Anthropic())

class SubTask(BaseModel):
    agent:       Literal["researcher", "analyst", "writer"]
    task:        str
    depends_on:  List[int] = []   # indices of tasks that must complete first

class OrchestratorPlan(BaseModel):
    goal_summary: str
    subtasks:    List[SubTask]

def orchestrate(user_goal: str) -> str:
    # 1. Orchestrator plans the work
    plan = instr_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content":
            f"""Break this goal into subtasks for specialised agents:
Goal: {user_goal}

Available agents:
- researcher: searches web, finds facts, gathers data
- analyst: processes data, identifies patterns, creates structured analysis
- writer: synthesises research and analysis into coherent written output"""}],
        response_model=OrchestratorPlan
    )

    results = {}

    # 2. Execute subtasks in dependency order
    for i, subtask in enumerate(plan.subtasks):
        # Wait for dependencies
        context = "\n\n".join(
            f"Result from task {j}: {results[j]}"
            for j in subtask.depends_on if j in results
        )

        # Dispatch to specialised subagent
        AGENT_SYSTEMS = {
            "researcher": "You are a researcher. Find accurate information. Cite sources.",
            "analyst":   "You are an analyst. Process data systematically. Be precise.",
            "writer":    "You are a technical writer. Write clearly for the target audience.",
        }
        task_with_context = subtask.task
        if context:
            task_with_context = f"Prior results:\n{context}\n\nYour task: {subtask.task}"

        results[i] = run_agent(
            user_message=task_with_context,
            system=AGENT_SYSTEMS[subtask.agent]
        )

    # 3. Final synthesis
    all_results = "\n\n".join(f"Task {i}: {r}" for i, r in results.items())
    return run_agent(
        user_message=f"Goal: {user_goal}\n\nAll subtask results:\n{all_results}\n\nSynthesize into a complete answer.",
        system="You are a senior analyst. Synthesise all results into a coherent, complete response."
    )

FREE LEARNING RESOURCES

Type	Resource	Best For
Article	Anthropic: Building Effective Agents — anthropic.com/research	The definitive guide on workflow patterns, when to use agents, and how to design reliable systems. Required reading.
Docs	LangGraph: Multi-Agent Systems — langchain-ai.github.io/langgraph	Supervisor patterns, handoff protocols, and shared memory between agents.
Article	OpenAI: A Practical Guide to Building Agents — cdn.openai.com	OpenAI's agent patterns including orchestrator-subagent and guardrail design.

🛠 Multi-Pattern Pipeline — Same Task, Three Architectures [Advanced] 3–4 days

Build the same complex task using three different architectures and compare reliability, cost, and latency. This is the exercise that builds real engineering judgment.

Task: Competitive Intelligence Report

Given a company name, produce a structured report: executive summary, products/services, market position, recent news, SWOT analysis.

Architecture 1 — Prompt Chain

5 fixed sequential LLM calls, each producing one section
Each step's output feeds the next as context

Architecture 2 — Parallel + Synthesis

4 parallel calls (exec summary, products, market, news)
1 final synthesis call combining all results

Architecture 3 — Orchestrator-Subagent

Orchestrator plans and dispatches to researcher + analyst + writer subagents
Each subagent has its own tools and system prompt

Evaluation

Run all three on the same 3 companies. Measure: total latency, total tokens, total cost, output quality (manual 1-5 rating)
Document: which architecture would you ship and why?

Skills: Prompt chaining, asyncio.gather, orchestrator-subagent, cost/latency measurement, architecture trade-off analysis

LAB 1

Tool Design Audit — Fix Three Bad Tools

Objective: Apply the tool design principles to real tool definitions and measure the improvement.

Write three intentionally bad tools: (a) a do_stuff(input) with vague name and description, (b) a tool that raises an exception on error instead of returning a dict, (c) a tool that does 3 things (fetch + process + save) in one call.

Connect these to an agent. Run 5 queries that should trigger these tools. Record how often the agent: selects the wrong tool, crashes on the exception, or produces inconsistent results from the multi-purpose tool.

Fix each tool: rename with specific verb+noun, return structured error dicts, split into atomic operations. Rerun the same 5 queries. Compare failure rates.

Add the USE/DON'T USE pattern to each tool description. Test with ambiguous queries that could trigger multiple tools — does selection improve?

LAB 2

Pattern Selection — Choose the Right Architecture

Objective: Practice the decision matrix by correctly categorising 10 real tasks.

For each of the following tasks, apply the decision matrix and determine the right pattern (chain, routing, parallel, agent, orchestrator): (a) translate a document to 5 languages, (b) answer a customer support email (billing/technical/general), (c) generate a test suite for a function, (d) research and write a 10-page market analysis, (e) summarise a meeting transcript into action items.

Implement two of the non-agent solutions (chain or parallel). Measure latency and cost vs a naive "just use an agent" implementation for the same tasks.

Document: For which tasks was the simpler architecture actually better? What would have gone wrong with the agent approach?

LAB 3

Parallel Fan-Out — Measure Real Speedup

Objective: Quantify the latency benefit of parallelisation on a real multi-perspective task.

Take a 500-word technical document. Build a sequential pipeline: 4 sequential LLM calls for security, performance, readability, and documentation reviews. Time the total.

Build the parallel version using asyncio.gather for the same 4 reviews. Time the total.

Add a 5th synthesis step (sequential in both versions). Compare: total time, total tokens, quality of synthesis output.

Test partial failure handling: make one of the 4 review calls intentionally fail. Does gather(return_exceptions=True) allow the other 3 to succeed? Does the synthesis handle the missing review gracefully?

P6-M20 MASTERY CHECKLIST

Can name the 3 tool design principles: idempotency, explicit error contracts, atomicity
All tools return a dict — never raise exceptions that the agent cannot handle
Tool names are verb+noun specific: search_knowledge_base not search
Tool descriptions include "USE when" and "DON'T USE when" sections
Can implement scope limiting: SQL tools allow only SELECT; rate limiting per tool
Can name all 5 workflow patterns: prompt chaining, routing, parallelisation, orchestrator-subagent, evaluator-optimizer
Can apply the "do I need an agent?" decision matrix to a new task
Know that chains are preferred over agents when steps are predictable
Can implement the evaluator-optimizer loop: generate → evaluate → improve → repeat
Can implement routing with a Pydantic classifier and specialised handlers
Can implement parallel fan-out with asyncio.gather for independent LLM calls
Know to use return_exceptions=True for fault-tolerant parallel calls</div>
Can implement voting (N parallel calls, majority answer) for classification tasks
Can implement orchestrator-subagent: planning → dispatch → synthesis
Completed Lab 1: tool design audit with before/after failure rate comparison
Completed Lab 2: pattern selection exercise with implementation and measurement
Completed Lab 3: parallel fan-out speedup measurement with partial failure test
Milestone project pushed to GitHub: 3-architecture comparison with findings

✅ When complete: Move to P6-M21 — Failure Handling in Agents. You now know how to design good agents. M21 covers what to do when they go wrong — which they will, at scale.

← P6-M19: Agent Loops 🗺️ All Modules Next: P6-M21 — Failure Handling →