What This Module Covers
Design ThinkingBuilding agents that work in a notebook is easy. Building agents that work reliably in production is hard. This module covers the engineering judgment that separates toy agents from production ones: how to design tools that are reliable, how to choose the right workflow architecture, and critically — when a simple chain beats a complex agent every time.
- Tool design principles — idempotency, error contracts, atomicity, what makes a good vs bad tool
- The five workflow patterns — prompt chaining, routing, parallelisation, orchestrator-subagent, evaluator-optimizer
- When NOT to use agents — the decision matrix that saves you from over-engineering
- Parallel workflows — fan-out/fan-in patterns, when to parallelise, how to handle partial failures
- Orchestrator-subagent — breaking complex goals into specialised sub-agents with handoff
What Makes a Good Tool
Design PrinciplesA tool is the interface between your agent and the real world. Bad tool design is the #1 source of agent failures — not the LLM, not the prompting.
# ── PRINCIPLE 1: Idempotent tools ──────────────────── # If the agent calls a tool twice with the same args, the result should be the same # and no duplicate side effects should occur # BAD: calling twice creates two records def create_ticket(title: str, description: str) -> dict: return db.insert("tickets", {"title": title, "description": description}) # GOOD: upsert on a natural key — safe to call multiple times def create_or_get_ticket(title: str, description: str) -> dict: existing = db.find_one("tickets", {"title": title}) if existing: return existing return db.insert("tickets", {"title": title, "description": description}) # ── PRINCIPLE 2: Explicit error contracts ──────────── # Never raise exceptions — return structured errors the agent can understand # BAD: agent receives an unhandled exception, gets confused def get_user(user_id: str) -> dict: return db.get("users", user_id) # raises KeyError if not found # GOOD: structured error the agent can reason about def get_user(user_id: str) -> dict: user = db.find_one("users", {"id": user_id}) if not user: return {"error": "USER_NOT_FOUND", "message": f"No user with id '{user_id}'", "suggestion": "Try searching by email with search_users()"} return {"success": True, "user": user} # ── PRINCIPLE 3: Atomic operations ─────────────────── # One tool should do ONE thing — not a chain of things # BAD: one tool does too much — partial failures are unrecoverable def process_order(order_id: str) -> dict: validate_stock() charge_payment() send_confirmation_email() update_inventory() # GOOD: separate tools, agent orchestrates the sequence def validate_stock(items: list) -> dict: ... def charge_payment(amount: float, card_id: str) -> dict: ... def send_confirmation_email(order_id: str, email: str) -> dict: ... def update_inventory(items: list, delta: int) -> dict: ...
Tool Description Engineering
Selection Precision# The description determines WHEN the agent calls the tool. # Bad descriptions → wrong tool selection → wrong results. # ── Pattern: Use When / Don't Use When ─────────────── SEARCH_TOOL = { "name": "search_knowledge_base", "description": """Search the internal knowledge base for product documentation, API references, and troubleshooting guides. USE when: - User asks about product features, configuration, or known issues - User needs step-by-step instructions from documentation - User references a specific version or release note DO NOT USE when: - Question is about general programming (use your training knowledge) - Question requires real-time data (use get_live_status instead) - Question is a math calculation (use calculate instead)""", "input_schema": { "type": "object", "properties": { "query": { "type": "string", "description": "Natural language search query. Be specific. Example: 'how to configure DPDK hugepages on Linux'" }, "version": { "type": "string", "description": "Optionally filter by product version, e.g. '23.11'. Omit for all versions." } }, "required": ["query"] } } # ── Consistent return schema ────────────────────────── # All tools should return a dict with consistent keys # so the agent can reliably check for success/failure def tool_success(data: dict, message: str = "") -> dict: return {"ok": True, "data": data, "message": message} def tool_error(code: str, message: str, suggestion: str = "") -> dict: return {"ok": False, "error_code": code, "message": message, "suggestion": suggestion}
💡 Tool names are critical. search is ambiguous — the agent doesn't know what it searches. search_knowledge_base, search_web, search_customer_records are unambiguous. When you have multiple search tools, the names must make the distinction obvious without reading the description.
Tool Safety — Scope Limiting and Validation
Production# Scope limiting — tools should only do what they say def query_database(sql: str, allowed_tables: list[str] = None) -> dict: # Validate it's a SELECT (never allow INSERT/DELETE/UPDATE from agent) if not sql.strip().upper().startswith("SELECT"): return tool_error("FORBIDDEN_OPERATION", "Only SELECT queries are allowed", "Use write_record() for data modification") # Validate only allowed tables are accessed if allowed_tables: import re tables_in_query = re.findall(r'FROM\s+(\w+)', sql, re.IGNORECASE) for t in tables_in_query: if t not in allowed_tables: return tool_error("TABLE_NOT_ALLOWED", f"Table {t!r} not in allowed list: {allowed_tables}") try: results = db.execute(sql) return tool_success({"rows": results, "count": len(results)}) except Exception as e: return tool_error("QUERY_ERROR", str(e)) # Rate limiting per tool to prevent runaway agents from collections import defaultdict import time _tool_calls = defaultdict(list) RATE_LIMITS = {"search_web": (10, 60)} # 10 calls per 60 seconds def rate_limit_check(tool_name: str) -> bool: if tool_name not in RATE_LIMITS: return True max_calls, window = RATE_LIMITS[tool_name] now = time.time() calls = [t for t in _tool_calls[tool_name] if now - t < window] _tool_calls[tool_name] = calls if len(calls) >= max_calls: return False _tool_calls[tool_name].append(now) return True
The Five Workflow Patterns
Architecture ToolkitThese five patterns cover 90% of real AI system architectures. Knowing them prevents you from reaching for a full agent when a simpler pattern will do.
Prompt Chaining
LLM output of step N feeds as input to step N+1. Each step does one thing well.
Routing
A classifier LLM routes input to one of several specialised handlers. Each handler is optimised for its class.
Parallelisation
Multiple LLM calls run concurrently on the same input. Results are aggregated (voting or merge).
Orchestrator-Subagent
A planning LLM breaks the task into subtasks and dispatches to specialised subagents. Results are synthesised.
Evaluator-Optimizer
One LLM generates output, another evaluates quality and provides feedback for improvement. Loops until quality threshold met.
Prompt Chaining — Implementation
Most Common# Prompt chaining: clean, testable, each step independently improvable # Each gate() call validates before passing to the next step def chain_extract_summarise_translate(document: str, target_lang: str) -> dict: # Step 1: Extract key facts facts = call_llm( system="Extract the 5 most important factual claims from this document. Output as a numbered list.", user=document ) if not facts: return {"error": "Extraction failed"} # Step 2: Summarise the facts summary = call_llm( system="Write a 2-3 sentence executive summary based on these key facts.", user=facts ) # Step 3: Translate (only if not English) if target_lang.lower() not in ("en", "english"): translated = call_llm( system=f"Translate to {target_lang}. Maintain tone and technical terms.", user=summary ) else: translated = summary return {"facts": facts, "summary": summary, "translated": translated} # Evaluator-Optimizer pattern def generate_with_quality_loop(prompt: str, max_iterations: int = 3) -> str: output = call_llm(system="Generate a response.", user=prompt) for i in range(max_iterations): evaluation = call_llm( system="""Evaluate this output for: accuracy, completeness, clarity. Return JSON: {"score": 1-10, "issues": [...], "passed": bool}""", user=f"Original prompt: {prompt}\n\nOutput: {output}" ) import json result = json.loads(evaluation) if result.get("passed") or result.get("score", 0) >= 8: break # Regenerate with feedback output = call_llm( system="Improve the output based on this feedback.", user=f"Previous output: {output}\n\nIssues: {result['issues']}" ) return output
Routing Pattern — LLM as Classifier
Scalablefrom typing import Literal
from pydantic import BaseModel
import instructor, anthropic
instr_client = instructor.from_anthropic(anthropic.Anthropic())
class RouteDecision(BaseModel):
category: Literal["billing", "technical", "general", "complaint"]
confidence: float
reasoning: str
def route_support_ticket(ticket: str) -> RouteDecision:
return instr_client.messages.create(
model="claude-3-haiku-20240307", # cheap model for routing
max_tokens=100,
messages=[{"role": "user",
"content": f"Classify this support ticket:\n\n{ticket}"}],
response_model=RouteDecision
)
# Specialised handlers — each optimised for its category
HANDLERS = {
"billing": handle_billing_ticket,
"technical": handle_technical_ticket,
"general": handle_general_ticket,
"complaint": handle_complaint_ticket,
}
def process_ticket(ticket: str) -> dict:
route = route_support_ticket(ticket)
handler = HANDLERS[route.category]
return handler(ticket)When NOT to Use Agents — The Decision Matrix
Most Important LessonThe most common mistake in AI engineering is reaching for agents when a simpler architecture would be more reliable, cheaper, and faster to debug. Agents introduce non-determinism — every additional LLM decision is a point of potential failure.
| Situation | Use Agent? | Better Alternative |
|---|---|---|
| Steps are always the same | ✗ No | Prompt chain — deterministic, testable |
| Steps depend on content classification | ✗ No | Routing — LLM classifier + fixed handlers |
| Independent subtasks on same input | ✗ No | Parallelisation — asyncio.gather() |
| Single API call answers the question | ✗ No | Simple function call or RAG query |
| Steps are known, but order varies by input | ✗ No | Routing with multiple fixed chains |
| Task requires dynamic tool selection | ✓ Yes | — |
| Number of steps not known in advance | ✓ Yes | — |
| Task requires reasoning about partial results | ✓ Yes | — |
| Task spans multiple API/DB systems dynamically | ✓ Yes | — |
# The "do I need an agent?" test — ask these questions in order: # # 1. Can I write the steps as a fixed Python function? # YES → use a chain or function call. NOT an agent. # # 2. Do the steps vary, but can I enumerate all the variations? # YES → use routing. NOT an agent. # # 3. Are the subtasks independent and can run in parallel? # YES → use asyncio.gather(). NOT an agent. # # 4. Is the sequence truly unpredictable until you see the data? # YES → now consider an agent. # # If you reach question 4 — also ask: # - Can I tolerate non-determinism in production? # - Do I have evaluation/monitoring to catch failures? # - Is the latency and cost of multi-turn LLM reasoning acceptable?
⚠️ Agents are harder to test, harder to debug, more expensive, and slower than deterministic pipelines. Every additional LLM call is a potential point of failure, cost, and latency. Anthropic's own guidelines say: augment agents with workflows wherever possible, and only add true autonomy where it is genuinely necessary.
Parallel Workflows — Fan-Out / Fan-In
Performance Patternimport asyncio, anthropic
async_client = anthropic.AsyncAnthropic()
async def call_llm_async(system: str, user: str) -> str:
response = await async_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": user}],
system=system
)
return response.content[0].text
# ── Pattern 1: Same input, multiple perspectives ──────
async def multi_perspective_review(code: str) -> dict:
security, performance, readability = await asyncio.gather(
call_llm_async("Review this code for security vulnerabilities only.", code),
call_llm_async("Review this code for performance issues only.", code),
call_llm_async("Review this code for readability and maintainability only.", code),
)
# Synthesise all three perspectives
synthesis = await call_llm_async(
"Combine these three code reviews into a single prioritised action list.",
f"Security:\n{security}\n\nPerformance:\n{performance}\n\nReadability:\n{readability}"
)
return {"security": security, "performance": performance,
"readability": readability, "synthesis": synthesis}
# ── Pattern 2: Different inputs, same processing ──────
async def process_documents_parallel(documents: list[str]) -> list[str]:
summaries = await asyncio.gather(
*[call_llm_async("Summarise in 2 sentences.", doc) for doc in documents]
)
return list(summaries)
# ── Pattern 3: Voting — run N times, take majority ────
async def classify_with_voting(text: str, n: int = 3) -> str:
from collections import Counter
labels = await asyncio.gather(
*[call_llm_async(
"Classify as POSITIVE, NEGATIVE, or NEUTRAL. Reply with one word only.", text
) for _ in range(n)]
)
labels = [l.strip().upper() for l in labels]
return Counter(labels).most_common(1)[0][0]
# ── Handling partial failures ──────────────────────────
async def gather_with_fallback(coroutines: list) -> list:
results = await asyncio.gather(*coroutines, return_exceptions=True)
processed = []
for r in results:
if isinstance(r, Exception):
processed.append({"error": str(r)})
else:
processed.append(r)
return processedOrchestrator-Subagent Pattern
Complex TasksFor complex tasks that span multiple domains (research + analysis + writing), an orchestrator LLM plans and dispatches to specialised subagents. Each subagent has its own tools and system prompt optimised for its domain.
from pydantic import BaseModel
from typing import List, Literal
import instructor, anthropic, asyncio
instr_client = instructor.from_anthropic(anthropic.Anthropic())
class SubTask(BaseModel):
agent: Literal["researcher", "analyst", "writer"]
task: str
depends_on: List[int] = [] # indices of tasks that must complete first
class OrchestratorPlan(BaseModel):
goal_summary: str
subtasks: List[SubTask]
def orchestrate(user_goal: str) -> str:
# 1. Orchestrator plans the work
plan = instr_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content":
f"""Break this goal into subtasks for specialised agents:
Goal: {user_goal}
Available agents:
- researcher: searches web, finds facts, gathers data
- analyst: processes data, identifies patterns, creates structured analysis
- writer: synthesises research and analysis into coherent written output"""}],
response_model=OrchestratorPlan
)
results = {}
# 2. Execute subtasks in dependency order
for i, subtask in enumerate(plan.subtasks):
# Wait for dependencies
context = "\n\n".join(
f"Result from task {j}: {results[j]}"
for j in subtask.depends_on if j in results
)
# Dispatch to specialised subagent
AGENT_SYSTEMS = {
"researcher": "You are a researcher. Find accurate information. Cite sources.",
"analyst": "You are an analyst. Process data systematically. Be precise.",
"writer": "You are a technical writer. Write clearly for the target audience.",
}
task_with_context = subtask.task
if context:
task_with_context = f"Prior results:\n{context}\n\nYour task: {subtask.task}"
results[i] = run_agent(
user_message=task_with_context,
system=AGENT_SYSTEMS[subtask.agent]
)
# 3. Final synthesis
all_results = "\n\n".join(f"Task {i}: {r}" for i, r in results.items())
return run_agent(
user_message=f"Goal: {user_goal}\n\nAll subtask results:\n{all_results}\n\nSynthesize into a complete answer.",
system="You are a senior analyst. Synthesise all results into a coherent, complete response."
)FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Article | Anthropic: Building Effective Agents — anthropic.com/research | The definitive guide on workflow patterns, when to use agents, and how to design reliable systems. Required reading. |
| Docs | LangGraph: Multi-Agent Systems — langchain-ai.github.io/langgraph | Supervisor patterns, handoff protocols, and shared memory between agents. |
| Article | OpenAI: A Practical Guide to Building Agents — cdn.openai.com | OpenAI's agent patterns including orchestrator-subagent and guardrail design. |
Build the same complex task using three different architectures and compare reliability, cost, and latency. This is the exercise that builds real engineering judgment.
Task: Competitive Intelligence Report
Given a company name, produce a structured report: executive summary, products/services, market position, recent news, SWOT analysis.
Architecture 1 — Prompt Chain
- 5 fixed sequential LLM calls, each producing one section
- Each step's output feeds the next as context
Architecture 2 — Parallel + Synthesis
- 4 parallel calls (exec summary, products, market, news)
- 1 final synthesis call combining all results
Architecture 3 — Orchestrator-Subagent
- Orchestrator plans and dispatches to researcher + analyst + writer subagents
- Each subagent has its own tools and system prompt
Evaluation
- Run all three on the same 3 companies. Measure: total latency, total tokens, total cost, output quality (manual 1-5 rating)
- Document: which architecture would you ship and why?
Skills: Prompt chaining, asyncio.gather, orchestrator-subagent, cost/latency measurement, architecture trade-off analysis
Tool Design Audit — Fix Three Bad Tools
Objective: Apply the tool design principles to real tool definitions and measure the improvement.
do_stuff(input) with vague name and description, (b) a tool that raises an exception on error instead of returning a dict, (c) a tool that does 3 things (fetch + process + save) in one call.Pattern Selection — Choose the Right Architecture
Objective: Practice the decision matrix by correctly categorising 10 real tasks.
Parallel Fan-Out — Measure Real Speedup
Objective: Quantify the latency benefit of parallelisation on a real multi-perspective task.
P6-M20 MASTERY CHECKLIST
- Can name the 3 tool design principles: idempotency, explicit error contracts, atomicity
- All tools return a dict — never raise exceptions that the agent cannot handle
- Tool names are verb+noun specific: search_knowledge_base not search
- Tool descriptions include "USE when" and "DON'T USE when" sections
- Can implement scope limiting: SQL tools allow only SELECT; rate limiting per tool
- Can name all 5 workflow patterns: prompt chaining, routing, parallelisation, orchestrator-subagent, evaluator-optimizer
- Can apply the "do I need an agent?" decision matrix to a new task
- Know that chains are preferred over agents when steps are predictable
- Can implement the evaluator-optimizer loop: generate → evaluate → improve → repeat
- Can implement routing with a Pydantic classifier and specialised handlers
- Can implement parallel fan-out with asyncio.gather for independent LLM calls
- Know to use return_exceptions=True for fault-tolerant parallel calls</div>
- Can implement voting (N parallel calls, majority answer) for classification tasks
- Can implement orchestrator-subagent: planning → dispatch → synthesis
- Completed Lab 1: tool design audit with before/after failure rate comparison
- Completed Lab 2: pattern selection exercise with implementation and measurement
- Completed Lab 3: parallel fan-out speedup measurement with partial failure test
- Milestone project pushed to GitHub: 3-architecture comparison with findings
✅ When complete: Move to P6-M21 — Failure Handling in Agents. You now know how to design good agents. M21 covers what to do when they go wrong — which they will, at scale.