P4-M12 - Structured Outputs & Tool Calling

Part 4 — LLM API Mastery · Module 12 of 14

Structured Outputs & Tool Calling

Get typed Python objects back from LLMs — and make them call your functions

⏱ 1 Week 🟡 Intermediate 🔧 Pydantic · Instructor · OpenAI · Anthropic 📋 Prerequisite: P4-M11

🎯

What This Module Covers

Core AI Engineering

In real applications you almost never want raw text from an LLM — you want structured data you can parse, store, validate, and use in your code. This module covers two critical techniques for getting reliable structure out of LLMs:

Structured outputs — forcing the model to return data that matches a Pydantic schema you define. Never parse free-text JSON again.
Tool calling (function calling) — giving the model the ability to call your Python functions. This is what transforms an LLM from a text generator into a system that can take real actions.

These two techniques are the foundation of agents, RAG pipelines, and any AI system that needs to interact with the real world. Master them here before building anything complex.

🔗

Why Structured Outputs Matter

Motivation

# The problem with raw text output
response = call_claude("Extract the name, age, and city from: 'John is 28, lives in Mumbai'")
# Response might be:
#   "The name is John, he is 28 years old, and he lives in Mumbai."
#   "Name: John
Age: 28
City: Mumbai"
#   {"name": "John", "age": "28", "city": "Mumbai"}  ← age is a string, not int!
#   {"name": "John", "age": 28}  ← city missing!
# You cannot reliably parse any of these

# With structured outputs (Pydantic + Instructor)
class Person(BaseModel):
    name: str
    age:  int
    city: str

person = extract(text, Person)
print(person.age + 1)   # 29 — it's always an int. Always present.

💡 Structured outputs solve three problems at once: type safety (age is always an int), completeness (required fields are always present), and consistency (same schema every time, regardless of how the model phrases its response).

📐

OpenAI Native Structured Outputs

OpenAI Only

OpenAI (gpt-4o and later) supports native structured outputs via response_format with a JSON schema. The model is guaranteed to return valid JSON matching your schema — it cannot deviate.

from openai import OpenAI
from pydantic import BaseModel
from typing import List, Optional

client = OpenAI()

class CalendarEvent(BaseModel):
    name:       str
    date:       str         # ISO format: YYYY-MM-DD
    participants: List[str]
    location:   Optional[str] = None

# Method 1: parse() helper — simplest approach
completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{
        "role": "user",
        "content": "Extract event: 'Meeting with Alice and Bob on 2024-03-15 at Bangalore office'"
    }],
    response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed
print(event.name)           # "Meeting"
print(event.participants)   # ["Alice", "Bob"]
print(event.date)           # "2024-03-15"
print(type(event))          # <class 'CalendarEvent'> — a real Python object

# Handle refusal (model refuses to comply with the request)
if completion.choices[0].message.refusal:
    print(f"Model refused: {completion.choices[0].message.refusal}")

🔧

JSON Mode vs Structured Outputs

Know the Difference

Feature	JSON Mode	Structured Outputs
Guarantees	Valid JSON only — no schema enforcement	Valid JSON matching exact schema
Missing fields	Can still omit required fields	Required fields always present
Wrong types	age can be "28" (string)	age is always int
Extra fields	Can add unexpected fields	Only schema fields returned
Use when	Quick prototyping, flexible schema	Production — any time you parse the output

# JSON mode — just ensures valid JSON, not schema compliance
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},   # JSON mode
    messages=[{"role": "user", "content": "Extract name and age as JSON"}]
)
data = json.loads(response.choices[0].message.content)
# data["age"] might be "28" or 28 — you don't know

🧩

Complex Pydantic Schemas

Real-World Patterns

from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from enum import Enum

# Nested models
class Address(BaseModel):
    street: str
    city:   str
    country: str
    postal_code: Optional[str] = None

class Contact(BaseModel):
    name:    str
    email:   str
    phone:   Optional[str] = None
    address: Optional[Address] = None   # nested model

# Enums for controlled vocabularies
class Priority(str, Enum):
    LOW    = "low"
    MEDIUM = "medium"
    HIGH   = "high"
    URGENT = "urgent"

class Ticket(BaseModel):
    title:    str
    priority: Priority               # must be one of 4 values
    tags:     List[str] = []
    assignee: Optional[Contact] = None

# Discriminated unions — different schema per type
class TextContent(BaseModel):
    type: Literal["text"]
    text: str

class ImageContent(BaseModel):
    type: Literal["image"]
    url:  str
    alt:  Optional[str] = None

from typing import Union, Annotated
Content = Annotated[Union[TextContent, ImageContent], Field(discriminator="type")]

class Post(BaseModel):
    title:    str
    contents: List[Content]   # can be text or image blocks

📦

Instructor — Structured Outputs for Every Provider

Production Standard

Instructor is the cleanest way to get structured outputs from any LLM provider using Pydantic models. It works with OpenAI, Anthropic, Google, HuggingFace, and 15+ others using the same code interface — and adds automatic retries when validation fails.

pip install instructor anthropic openai

import instructor
import anthropic
from openai import OpenAI
from pydantic import BaseModel
from typing import List

# ── With Anthropic (Claude) ────────────────────────────
claude_client = instructor.from_anthropic(anthropic.Anthropic())

class MovieReview(BaseModel):
    title:       str
    rating:      float   # 1.0 to 10.0
    pros:        List[str]
    cons:        List[str]
    recommended: bool

review = claude_client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Review the movie Interstellar"
    }],
    response_model=MovieReview,   # ← Pydantic model as schema
)

print(review.title)       # "Interstellar"
print(review.rating)      # 9.2  — always a float
print(review.recommended) # True — always a bool

# ── With OpenAI (GPT-4o) ───────────────────────────────
oai_client = instructor.from_openai(OpenAI())

review = oai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Review Interstellar"}],
    response_model=MovieReview,   # ← exact same code
)
# Same API regardless of provider — easy to switch

🔄

Automatic Retries and Partial Extraction

Reliability

import instructor
from instructor import Mode
from pydantic import BaseModel, field_validator

# Instructor retries automatically when validation fails
client = instructor.from_anthropic(
    anthropic.Anthropic(),
    mode=Mode.ANTHROPIC_JSON,
    max_retries=3   # retry up to 3 times if schema not satisfied
)

class StrictRating(BaseModel):
    score: float
    label: str

    @field_validator("score")
    @classmethod
    def must_be_in_range(cls, v: float) -> float:
        if not (1.0 <= v <= 10.0):
            raise ValueError(f"Score {v} must be between 1.0 and 10.0")
        return round(v, 1)

    @field_validator("label")
    @classmethod
    def must_be_valid_label(cls, v: str) -> str:
        valid = {"excellent", "good", "average", "poor"}
        if v.lower() not in valid:
            raise ValueError(f"Label must be one of {valid}")
        return v.lower()

# If model returns score=11.0, Instructor catches the validation error,
# tells the model what went wrong, and asks it to try again

# Partial extraction — stream partial objects as they are generated
from instructor import Partial

class LargeReport(BaseModel):
    executive_summary: str
    key_findings:      List[str]
    recommendations:   List[str]
    conclusion:        str

# Stream partial object — UI can update progressively
for partial_report in client.messages.create_partial(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Generate a quarterly report"}],
    response_model=Partial[LargeReport],
):
    if partial_report.executive_summary:
        print(partial_report.executive_summary, end="
")

💡 Automatic retries are Instructor's killer feature. When a field validator raises a ValueError, Instructor sends the model a message saying "Your previous response failed validation: [error]. Please fix and try again." The model almost always succeeds on the second attempt. This makes structured extraction production-ready.

🏭

Real-World Extraction Patterns

Production Use Cases

# 1. Invoice parser
class LineItem(BaseModel):
    description: str
    quantity:    int
    unit_price:  float
    total:       float

class Invoice(BaseModel):
    invoice_number: str
    vendor:         str
    line_items:     List[LineItem]
    subtotal:       float
    tax_rate:       float
    total:          float
    due_date:       str   # YYYY-MM-DD

# 2. Meeting notes → action items
class ActionItem(BaseModel):
    task:      str
    owner:     str
    due_date:  Optional[str]
    priority:  Literal["high", "medium", "low"]

class MeetingNotes(BaseModel):
    summary:     str
    decisions:   List[str]
    action_items: List[ActionItem]
    next_meeting: Optional[str]

# 3. Job description parser
class JobDescription(BaseModel):
    role:             str
    company:          str
    location:         str
    salary_min:       Optional[int]
    salary_max:       Optional[int]
    required_skills:  List[str]
    preferred_skills: List[str]
    years_experience: Optional[int]
    remote:           bool

# 4. Support ticket classifier
class SupportTicket(BaseModel):
    category:    Literal["billing", "technical", "account", "general"]
    priority:    Literal["p1", "p2", "p3"]
    sentiment:   Literal["frustrated", "neutral", "positive"]
    summary:     str
    needs_human: bool

🔧

Tool Calling — The Mental Model

Critical Concept

Tool calling is what transforms an LLM from a text generator into something that can take actions — search the web, query a database, call your API, run code. Before writing any code, understand what actually happens:

You define tools
(JSON schemas)

→

LLM decides
which tool to call

→

LLM returns
tool_call object

→

YOUR code executes
the actual function

→

LLM sees result,
generates response

⚠️ The model does NOT execute your functions. It only returns a structured object saying "I want to call get_weather with city='Mumbai'". Your code reads that object and actually calls the function. This distinction is critical for security — you control what runs.

# What a tool call response looks like (Anthropic)
{
    "type": "tool_use",
    "id":   "toolu_01A09q90qw90lq917835lq9",
    "name": "get_weather",
    "input": {
        "city": "Mumbai",
        "units": "celsius"
    }
}
# YOUR code then calls: get_weather(city="Mumbai", units="celsius")

📝

Defining Tools — The 5-Step Pattern

Core Pattern

import anthropic
import json

client = anthropic.Anthropic()

# STEP 1: Define your Python functions
def get_weather(city: str, units: str = "celsius") -> dict:
    # In production: call a real weather API
    return {"city": city, "temp": 28, "condition": "sunny", "units": units}

def calculate(expression: str) -> dict:
    try:
        result = eval(expression, {"__builtins__": {}})  # safe eval
        return {"result": result, "expression": expression}
    except Exception as e:
        return {"error": str(e)}

# STEP 2: Describe the tools in JSON Schema
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a specific city. Use this when the user asks about weather, temperature, or conditions in a location.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city name, e.g. 'Mumbai', 'Delhi', 'Bangalore'"
                },
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature unit. Default: celsius"
                }
            },
            "required": ["city"]
        }
    },
    {
        "name": "calculate",
        "description": "Evaluate a mathematical expression. Use this for any arithmetic, percentage, or numeric calculation. Do NOT use this for non-math questions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "A valid Python math expression, e.g. '(100 * 1.15) + 50'"
                }
            },
            "required": ["expression"]
        }
    }
]

# STEP 3: Send request with tools
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Mumbai? Also, what is 15% of 2500?"}]
)

# STEP 4: Execute the tool calls
tool_results = []
for block in response.content:
    if block.type == "tool_use":
        if block.name == "get_weather":
            result = get_weather(**block.input)
        elif block.name == "calculate":
            result = calculate(**block.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": json.dumps(result)
        })

# STEP 5: Send results back to get final response
final_response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user",      "content": "What's the weather in Mumbai? Also, 15% of 2500?"},
        {"role": "assistant", "content": response.content},
        {"role": "user",      "content": tool_results}
    ]
)
print(final_response.content[0].text)

🎯

Writing Tool Descriptions That Work

Critical Skill

The tool description is the model's user manual. A vague description leads to wrong tool selection. Be explicit about when to use the tool, not just what it does.

# BAD tool description — vague
{
    "name": "search",
    "description": "Search for information",
    "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}}
}

# GOOD tool description — specific when/what/not
{
    "name": "search_knowledge_base",
    "description": """Search the internal company knowledge base for product documentation,
FAQs, and policy documents. Use this when the user asks about:
- Product features or specifications
- Company policies or procedures
- Troubleshooting steps

Do NOT use this for: general knowledge questions, math calculations,
or anything not related to company products and policies.""",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query, e.g. 'How do I reset my password?'"
            },
            "category": {
                "type": "string",
                "enum": ["products", "policies", "support"],
                "description": "Filter results by category. Optional."
            }
        },
        "required": ["query"]
    }
}

Name — self-explanatory verb: search_knowledge_base not search
Description — explain WHEN to call (not just what), give examples, and state when NOT to use it
Parameters — include examples in descriptions: "e.g. 'Mumbai', 'Delhi'"
Required vs optional — mark truly optional params as optional with sensible defaults

⚙️

OpenAI Tool Calling

Syntax Differences

from openai import OpenAI
client = OpenAI()

# OpenAI uses slightly different field names
tools = [{
    "type": "function",                # required wrapper
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {              # "parameters" not "input_schema"
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    tools=tools,
    tool_choice="auto",     # "auto" | "required" | "none" | specific tool
    messages=[{"role": "user", "content": "Weather in Mumbai?"}]
)

# Parse tool calls
message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        name = tool_call.function.name
        args = json.loads(tool_call.function.arguments)
        # Execute function based on name...

🔁

The Complete Tool Loop — Production Pattern

Production

import anthropic, json
from typing import Any

client = anthropic.Anthropic()

# Tool registry — maps name → function
TOOL_REGISTRY = {
    "get_weather":    get_weather,
    "calculate":      calculate,
    "search_notes":   search_notes,
}

def run_tool_loop(user_message: str, tools: list, max_turns: int = 10) -> str:
    """Run a complete tool loop until the model produces a final text response."""
    messages = [{"role": "user", "content": user_message}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # Check stop reason
        if response.stop_reason == "end_turn":
            # Model finished — return text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""

        if response.stop_reason != "tool_use":
            break   # unexpected stop reason

        # Append assistant message
        messages.append({"role": "assistant", "content": response.content})

        # Execute all tool calls
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue

            func = TOOL_REGISTRY.get(block.name)
            if func is None:
                result = {"error": f"Unknown tool: {block.name}"}
            else:
                try:
                    result = func(**block.input)
                except Exception as e:
                    result = {"error": str(e), "tool": block.name}

            tool_results.append({
                "type":        "tool_result",
                "tool_use_id": block.id,
                "content":     json.dumps(result)
            })

        messages.append({"role": "user", "content": tool_results})

    return "Max turns reached without final response"

# Usage
answer = run_tool_loop(
    "What's the weather in Mumbai and Delhi? Which city is warmer?",
    tools=tools
)
print(answer)

⚙️

tool_choice — Controlling Which Tool Gets Called

Control

# Anthropic tool_choice options

# "auto" (default) — model decides whether to use a tool or respond directly
tool_choice={"type": "auto"}

# "any" — model MUST call a tool (useful to force structured extraction)
tool_choice={"type": "any"}

# Specific tool — model MUST call this exact tool
tool_choice={"type": "tool", "name": "extract_invoice"}

# When to use each:
# "auto"     — conversational agents where tool use is optional
# "any"      — when you always need structured output (extraction pipelines)
# specific   — when you know exactly which tool to force (single-purpose endpoints)

# OpenAI equivalents
tool_choice = "auto"       # let model decide
tool_choice = "required"   # must use a tool (= Anthropic "any")
tool_choice = "none"       # never use tools
tool_choice = {"type": "function", "function": {"name": "get_weather"}}  # force specific

⚡

Parallel Tool Calls

Performance

Modern models can call multiple tools in a single turn. This is dramatically faster than sequential calls — instead of 3 round trips to the API, you do 1.

# The model may return multiple tool_use blocks in one response
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user",
               "content": "Get weather for Mumbai, Delhi, and Bangalore"}]
)

# response.content may contain 3 tool_use blocks simultaneously
# Execute all of them, then send all results back at once

import asyncio

async def execute_tool_calls_parallel(tool_calls: list) -> list:
    """Execute multiple tool calls concurrently."""
    async def execute_one(block) -> dict:
        func = TOOL_REGISTRY.get(block.name)
        result = await asyncio.to_thread(func, **block.input)
        return {
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": json.dumps(result)
        }
    return await asyncio.gather(*[execute_one(b) for b in tool_calls])

💡 Parallel tool calls matter for agents. An agent researching 5 topics simultaneously via search tools is 5× faster than one that searches sequentially. Always process all tool_use blocks in a single response together, not one by one.

FREE LEARNING RESOURCES

Type	Resource	Best For
Docs	OpenAI Structured Outputs Guide — platform.openai.com	Covers the feature that ensures models always generate responses adhering to your JSON Schema.
Library	Instructor library — python.useinstructor.com	The cleanest way to get structured outputs from any LLM provider. Production standard.
Docs	OpenAI Function Calling Guide — platform.openai.com	Definitive reference for tool calling with OpenAI models.
Docs	Anthropic Tool Use Docs — docs.anthropic.com	Anthropic's complete guide to tool calling with Claude.
Notebook	OpenAI Cookbook: How to Call Functions — github.com/openai/openai-cookbook	Complete runnable notebook walking through the full tool-calling loop with real examples.

MILESTONE PROJECT

🛠 Invoice Parser + 3-Tool Assistant [Intermediate] 3–4 days

Part A — Invoice Parser: Use Instructor to extract structured data from raw invoice text.

Define a full Invoice Pydantic model: invoice_number, vendor, line_items (list), subtotal, tax_rate, total, due_date
Test on 5 different invoice text formats (different layouts, missing fields, different currencies)
Add field validators: total must equal subtotal * (1 + tax_rate), due_date must be valid ISO date
Observe Instructor's automatic retry behaviour when validation fails

Part B — 3-Tool Assistant: Build a conversational assistant with three callable tools.

get_weather(city) — calls Open-Meteo API (no key needed)
calculate(expression) — evaluates math expressions safely
search_notes(query) — searches a hardcoded dict of notes by keyword
Implement the full 5-step tool loop with parallel execution
Test with: "What's the weather in Mumbai?", "What is 15% of 8500?", "Find notes about Python", "What's the weather in Delhi and Mumbai, and which is warmer?" (parallel)

Skills: Pydantic, Instructor, field validators, Anthropic/OpenAI SDK, tool calling loop, parallel tool execution

LAB 1

Structured Extraction — Compare JSON Mode vs Instructor

Objective: Directly observe what structured outputs guarantee vs what JSON mode does not.

Build a Contact extractor: name (str), email (str), phone (Optional[str]), company (Optional[str]). Use the same 10 test inputs: some with all fields, some with missing fields, one with malformed email, one with phone in different formats.

Version A: JSON mode only — parse the response text with json.loads(). Run all 10 inputs. Count: how many parsed successfully? How many had wrong types? How many were missing required fields?

Version B: Instructor with Pydantic model. Run the same 10 inputs. Count the same metrics. Compare.

Add a validator that normalises phone numbers to E.164 format (+91XXXXXXXXXX). Watch Instructor retry when the model returns "9876543210" (not E.164). Count how many retries occurred across all 10 inputs.

Document: What failure modes did JSON mode have that Instructor caught? When is JSON mode "good enough"?

LAB 2

Tool Description Quality — See How It Affects Selection

Objective: Empirically measure how tool description quality affects which tool the model selects.

Create 3 tools: get_weather, search_docs, calculate. Write Version A with minimal descriptions (just the tool name and one line).

Test 10 ambiguous messages that could fit multiple tools: "How much is 28 degrees in Fahrenheit?", "Find information about temperature limits in the docs", "What is the current temperature in Mumbai?" Record which tool was selected each time.

Write Version B with full descriptions including "Use when:", "Do NOT use when:", examples in parameter descriptions. Run the same 10 messages.

Compare selections. How many changed? Which changes were improvements? Document the 3 most impactful improvements you made to descriptions.

LAB 3

Build and Test the Complete Tool Loop

Objective: Build the complete production tool loop and test every edge case.

Implement the run_tool_loop() function from Tab 4 with the 3 tools (weather, calculate, search_notes).

Test happy path: "What's 20% tip on a ₹2400 bill?" — should call calculate and return a clear answer.

Test no-tool path: "What is the capital of France?" — model should answer directly without calling any tool. Verify stop_reason == "end_turn" on the first turn.

Test parallel calls: "What is the weather in Mumbai, Delhi, and Bangalore?" — should trigger 3 simultaneous tool_use blocks in one response. Verify all 3 are executed before the next API call.

Test error handling: make get_weather() raise an exception for "InvalidCity". Does the model gracefully handle the error in the tool_result? What does it tell the user?

Test max_turns: give the model a tool that always returns "try again" and verify the loop terminates at max_turns rather than running forever.

P4-M12 MASTERY CHECKLIST

Can explain the difference between JSON mode and structured outputs — and when each is appropriate
Can define a Pydantic model with nested objects, enums, Optional fields, and List fields
Can use Instructor with both Anthropic and OpenAI clients using the same Pydantic model
Can add field validators to Pydantic models and understand how Instructor handles validation failures
Understand the tool calling mental model: the model does NOT execute functions — it returns structured call objects
Can write a tool definition with a description that clearly states when to use and when not to use it
Can implement the complete 5-step tool calling loop for Anthropic (Claude)
Can implement the equivalent for OpenAI (note the different field names)
Know what tool_choice options exist and when to use "auto" vs "any" vs specific tool
Can handle parallel tool calls — processing all tool_use blocks before the next API call
Can implement a production tool loop with max_turns, error handling, and tool registry
Know that better tool descriptions (with when/when-not-to examples) produce more reliable tool selection
Completed Lab 1: JSON mode vs Instructor comparison
Completed Lab 2: tool description quality experiment
Completed Lab 3: complete tool loop with all edge cases tested
Milestone project pushed to GitHub with README

✅ When complete: Move to P4-M13 — Streaming & Conversation State. The tool calling patterns you built here are the foundation of agents in Part 6 — agents are just tool loops with more sophisticated decision logic.

← P4-M11: Prompting 🗺️ All Modules Next: P4-M13 — Streaming →