What This Module Covers
Core of AI EngineeringPrompting is not just asking questions nicely. It is the craft of writing instructions that produce consistent, reliable outputs from models that are fundamentally probabilistic. As an AI engineer, you will spend a surprising amount of time here — a prompt that works 80% of the time is not good enough for production.
- Message structure — system vs user vs assistant roles, what each controls
- Zero-shot, one-shot, few-shot — when and how to use examples
- Chain-of-thought (CoT) — making models reason step-by-step before answering
- Role prompting — assigning personas for consistent tone and behaviour
- XML structuring — using tags to separate instructions from content
- Output formatting — controlling response format without structured outputs
- Prompt debugging — how to systematically improve a prompt that is not working
💡 Prompting is the foundation of every module from here. Good prompting makes tool calling more reliable, RAG answers more grounded, agents more predictable, and structured outputs easier to validate. This week sets the quality floor for everything you build.
First API Call — Getting Started
Setuppip install anthropic openai python-dotenv # .env file # ANTHROPIC_API_KEY=sk-ant-... # OPENAI_API_KEY=sk-proj-... import anthropic import os from dotenv import load_dotenv load_dotenv() client = anthropic.Anthropic() # Your first API call response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "What is the capital of France?"} ] ) print(response.content[0].text) # "Paris" print(response.usage.input_tokens, response.usage.output_tokens)
# OpenAI equivalent from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": "What is the capital of France?"} ] ) print(response.choices[0].message.content)
The Messages Array — System, User, Assistant
CriticalEvery LLM API call uses a messages array with specific roles. Understanding what each role controls is the foundation of all prompt engineering.
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system="""You are a senior Python engineer at a fintech company.
You write clean, well-documented code with type hints.
You always consider edge cases and error handling.
When you are unsure about a requirement, ask a clarifying question.""",
messages=[
{"role": "user", "content": "Write a function to parse currency strings like '$1,234.56'"},
{"role": "assistant", "content": "Here is the implementation:
def parse_currency..."},
{"role": "user", "content": "Also handle Euro format: 1.234,56 €"},
]
)| Role | What It Controls | When to Use |
|---|---|---|
system | Persistent instructions, persona, constraints, output format — set once, applies to the entire conversation | Defining who the model is and how it behaves. Never put user-controllable data here. |
user | The human turn — questions, requests, context, documents to process | Every human message. Can include retrieved documents, examples, data. |
assistant | The model's previous responses — used to maintain conversation history | Multi-turn conversations. Also used to "prefill" — start the model's response to steer output format. |
Key Parameters — Temperature, max_tokens, top_p
Parametersresponse = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048, # Maximum tokens in response. Set high enough for your use case.
temperature=0.0, # 0.0 = deterministic, 1.0 = creative, >1.0 = chaotic
# top_p=0.9, # nucleus sampling — alternative to temperature
messages=[...]
)| Parameter | Effect | Recommended Value |
|---|---|---|
temperature=0.0 | Fully deterministic — same prompt always gives same output | Data extraction, classification, code generation |
temperature=0.3 | Mostly consistent with slight variation | Q&A, summarisation, analysis |
temperature=0.7 | Balanced creativity vs consistency | Writing assistance, brainstorming |
temperature=1.0+ | High creativity, unpredictable | Creative fiction, poetry, divergent ideas |
⚠️ For AI engineering tasks (data extraction, classification, tool calling), always use temperature=0.0. Any non-zero temperature means the model might give different answers to the same question — which breaks deterministic pipelines.
Prefilling — Controlling Output Format
AdvancedAnthropic (Claude) supports "prefilling" — you start the assistant's response to force a specific format. This is extremely powerful for structured outputs without the full Pydantic machinery.
# Force JSON output by prefilling with opening brace response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "Extract: name, age, city from: 'John is 28 years old and lives in Mumbai'"}, {"role": "assistant", "content": "{"}, # ← prefill forces JSON start ] ) # Model MUST continue the JSON: {"name": "John", "age": 28, "city": "Mumbai"} result = "{" + response.content[0].text # prepend the "{" we used as prefill # Force numbered list format messages=[ {"role": "user", "content": "List 5 benefits of RAG"}, {"role": "assistant", "content": "1."}, # ← forces numbered list ]
The Six Core Techniques
Master TheseZero-Shot
Give the task with no examples. Works well for simple, well-defined tasks. Fastest and cheapest.
Few-Shot
Provide 2–5 input/output examples before the real task. Most reliable technique for consistent formatting.
Chain-of-Thought
Ask the model to reason step-by-step before answering. Dramatically improves accuracy on complex tasks.
Role Prompting
Assign a specific persona in the system prompt. Anchors tone, vocabulary, and domain expertise.
XML Tags
Use <tags> to clearly separate instructions, context, examples, and the actual task. Prevents confusion.
Output Constraints
Explicitly state the format, length, tone, and structure you want in the output.
Zero-Shot vs Few-Shot — When to Use Each
Most Used# Few-shot implementation EXAMPLES = [ ("Amazing product, works perfectly!", "POSITIVE"), ("Arrived broken, waste of money.", "NEGATIVE"), ("It does what it says.", "NEUTRAL"), ] def classify_review(review: str) -> str: example_text = "\n".join( f'Review: "{inp}" → {out}'' for inp, out in EXAMPLES ) prompt = f"""Classify each review as POSITIVE, NEGATIVE, or NEUTRAL. Reply with only the label. {example_text} Review: "{review}" →""" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=10, # only need one word temperature=0.0, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text.strip()
Chain-of-Thought (CoT) — Reasoning Before Answering
Accuracy BoosterCoT dramatically improves accuracy on tasks requiring reasoning, multi-step logic, or math. The model "thinks out loud" and catches its own errors before committing to an answer.
# Zero-Shot CoT — just add "Think step by step" prompt = f""" {question} Think step by step before giving your final answer. """ # CoT with scratchpad — separate reasoning from answer prompt = f""" {question} First, reason through this carefully in a <scratchpad> tag. Then give your final answer in an <answer> tag. """ # Parse out just the answer (not the reasoning) import re response_text = response.content[0].text answer_match = re.search(r'<answer>(.*?)</answer>', response_text, re.DOTALL) if answer_match: answer = answer_match.group(1).strip() # CoT for classification — "Explain your reasoning, then classify" system = """Analyze the given text. First explain your reasoning in 1-2 sentences. Then output exactly one of: POSITIVE / NEGATIVE / NEUTRAL on a new line."""
💡 CoT works because it changes what tokens the model predicts. Without CoT, the model predicts the final answer token directly. With CoT, it predicts reasoning tokens first, which condition it to predict a better final answer. The reasoning is not just cosmetic — it actually changes the computation.
XML Tags — Separating Instructions from Content
Anthropic RecommendedXML tags prevent the model from confusing your instructions with the content it is processing. This is especially important when the user-provided content might contain instruction-like language.
# XML tag pattern — use for any user-provided content def summarise(document: str, max_sentences: int = 3) -> str: prompt = f"""Summarise the document below in {max_sentences} sentences. Focus on the key points. Do not include opinions not present in the text. <document> {document} </document> Summary:""" return call_claude(prompt) # Multi-section prompt with XML prompt = f"""You are a code reviewer. Review the code below. <requirements> {requirements} </requirements> <code> {code} </code> Identify: bugs, missing error handling, style issues. Format your response as a numbered list."""
Role Prompting — Consistent Persona and Expertise
System Prompt# Role prompt anchors tone, vocabulary, and domain expertise # Customer support agent SUPPORT_SYSTEM = """You are Alex, a friendly customer support agent at TechCorp. You have deep knowledge of TechCorp products and policies. Guidelines: - Always acknowledge the customer's frustration before troubleshooting - Offer concrete next steps, not vague reassurances - If you cannot resolve an issue, escalate clearly: "I'll escalate this to our specialist team." - Never promise things you cannot guarantee - Keep responses concise: 2-3 paragraphs maximum""" # Technical documentation writer DOCS_SYSTEM = """You are a technical writer at a developer tools company. You write clear, precise documentation for software engineers. Style: active voice, present tense, second person ("you"). Format: use code examples for every concept. Include "When to use" and "When NOT to use" sections. Audience: senior engineers who prefer depth over simplification.""" # Data analysis assistant DATA_SYSTEM = """You are a senior data analyst. When given data or questions about data: 1. Start with the most important insight, not methodology 2. Quantify everything — use specific numbers, not vague terms like "many" or "few" 3. Flag data quality issues proactively 4. Distinguish between correlation and causation explicitly 5. Always suggest the next most valuable analysis"""
💡 The best role prompts specify behaviour, not just identity. "You are a Python expert" is weak. "You are a Python expert who always writes type hints, documents edge cases, and asks clarifying questions before implementing" is strong — it specifies what the model does, not just what it is.
Prompt Engineering for Reliability
Production QualityA prompt that works in testing but fails in production is not a good prompt. These patterns improve reliability across varied inputs.
# 1. Explicit output format — remove ambiguity BAD = "Extract the key information from this contract." GOOD = """Extract from this contract: - Party A (company name and jurisdiction) - Party B (company name and jurisdiction) - Contract value (number and currency) - Start date (ISO format: YYYY-MM-DD) - End date (ISO format: YYYY-MM-DD) If any field is not present, output: null Output as JSON only. No prose.""" # 2. Negative instructions — tell the model what NOT to do system = """You are a medical information assistant. DO NOT provide specific diagnoses. DO NOT recommend specific medications or dosages. DO NOT suggest the user stop or change current medications. Always recommend consulting a qualified healthcare provider.""" # 3. Fallback handling — what to do when unsure prompt = """Answer the user's question based only on the provided context. If the answer is not in the context, respond exactly with: "I don't have enough information to answer this question." Do not make up information. <context> {context} </context> Question: {question}""" # 4. Confidence calibration prompt = """Answer the question. After your answer, rate your confidence: HIGH: you are certain this is correct MEDIUM: you are fairly confident but acknowledge uncertainty LOW: you are guessing and the user should verify Format: [answer] Confidence: HIGH/MEDIUM/LOW Reason for confidence level: [one sentence]"""
Self-Consistency and Verification
Accuracy Patterns# Self-consistency — run same prompt N times, take majority vote from collections import Counter def classify_with_consistency(text: str, n: int = 5) -> str: results = [] for _ in range(n): response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=10, temperature=0.3, # slight variation per run messages=[{"role": "user", "content": f'Classify: {text}'}] ) results.append(response.content[0].text.strip()) most_common = Counter(results).most_common(1)[0] return most_common[0] # most frequent answer # Verify-and-correct — ask model to check its own work async def verified_extraction(text: str) -> dict: # Step 1: extract extraction = await extract(text) # Step 2: verify verify_prompt = f"""Check if this extraction is accurate and complete. Original text: {text} Extracted data: {extraction} Is anything missing, incorrect, or hallucinated? If correct, respond: VERIFIED If issues found, respond: ISSUES: [describe what's wrong]""" verification = await call_claude(verify_prompt) if "ISSUES:" in verification: # Step 3: re-extract with the issues identified return await extract_with_context(text, verification) return extraction
Prompt Debugging — Systematic Improvement
Debugging ProcessWhen a prompt is not working, do not randomly tweak it. Follow this systematic process:
# STEP 1: Identify failure mode # - Wrong format? → add explicit format instructions # - Hallucinating? → add grounding instructions + "only use provided context" # - Too verbose? → add length constraints # - Wrong tone? → strengthen role prompt # - Inconsistent? → add few-shot examples of correct output # - Missing cases? → add explicit instructions for edge cases # STEP 2: Build a test set test_cases = [ {"input": "easy case", "expected": "X"}, {"input": "edge case", "expected": "Y"}, {"input": "adversarial", "expected": "Z"}, ] # STEP 3: Measure baseline accuracy def evaluate_prompt(prompt_template: str, test_cases: list) -> float: correct = 0 for case in test_cases: result = call_claude(prompt_template.format(**case)) if result.strip() == case["expected"]: correct += 1 return correct / len(test_cases) # STEP 4: Make ONE change at a time and re-measure # Never change multiple things simultaneously — you won't know what helped # STEP 5: Document what worked and why # Prompts are code — version control them like code
💡 The most common prompting mistake is changing multiple things at once when debugging. If you add examples AND change the format instructions AND modify the role prompt, and things improve, you do not know which change helped. Change one thing, measure, then decide.
The 8 Most Common Prompting Mistakes
Avoid These1. Vague instructions
2. Burying the instruction at the end
3. Asking two things at once without format separation
4. Not testing on edge cases
- Test with empty input, very long input, non-English input, adversarial input ("ignore previous instructions")
- Test with inputs that have the right structure but wrong content
- Test the exact failure cases your users will hit — not just the happy path
5. Using temperature > 0 for extraction tasks
- Always use temperature=0.0 for classification, extraction, code generation, and anything requiring deterministic output
- Only use temperature > 0 for creative tasks where variation is desirable
6. Not grounding the model on factual tasks
7. Prompt injection — not sanitising user input
- Always wrap user-provided content in XML tags so the model distinguishes it from instructions
- Never concatenate user input directly into system prompt instructions
8. Skipping few-shot examples for format-sensitive tasks
- If your application requires a specific JSON shape, CSV format, or custom structure — provide 2–3 exact examples
- Instructions alone are rarely enough for precise formatting — show, don't just tell
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Tutorial | Anthropic Interactive Prompt Engineering Tutorial — github.com/anthropics/prompt-eng-interactive-tutorial | Best hands-on prompting course. 9 chapters with exercises. Run as Jupyter notebooks with the Claude API. |
| Docs | Anthropic Prompt Engineering Docs — docs.anthropic.com | Official reference covering XML structuring, agentic systems, and advanced patterns. |
| Docs | OpenAI Prompt Engineering Guide — platform.openai.com | OpenAI's official guide. Covers formats that work well with GPT models. |
| Guide | PromptingGuide.ai — promptingguide.ai | Comprehensive guide from basic to advanced strategies. Good CoT and agent sections. |
MILESTONE PROJECT
The best way to internalise prompting techniques is to compare them directly on the same task. This project forces you to observe exactly how prompt design affects output quality.
Task: Sentiment analysis on product reviews
Given 20 product reviews (mix of positive, negative, neutral, and ambiguous), write 5 different prompts and compare their output quality, consistency, and handling of edge cases.
Prompts to write and compare
- Prompt 1 — Zero-shot bare: Just ask for sentiment with no guidance
- Prompt 2 — Zero-shot with format: Specify exact output format (one word: POSITIVE/NEGATIVE/NEUTRAL)
- Prompt 3 — Few-shot: 3 labelled examples before the real input
- Prompt 4 — CoT: Ask model to reason before classifying
- Prompt 5 — Role + few-shot + CoT combined: Best possible prompt
Measurement
- Run each prompt on all 20 reviews (temperature=0.0 for fair comparison)
- Manually label all 20 reviews yourself — this is your ground truth
- Calculate accuracy for each prompt. Compare consistency across runs.
- Write a 1-paragraph conclusion: what techniques made the biggest difference and why
Skills: Anthropic/OpenAI SDK, prompt design, systematic evaluation, few-shot construction
Build a Python function summarise(document, style="executive") where style can be "executive" (3 bullet points), "technical" (key decisions + technical details), or "casual" (plain English, conversational). Each style uses a different system prompt and demonstrates how role + output format instructions change output character completely. Test on 3 real articles.
System Prompt Isolation — See Exactly What It Controls
Objective: Build intuition for what system prompts do by running identical user messages with radically different system prompts.
Few-Shot Example Quality — Good vs Bad Examples
Objective: Discover how the quality and choice of few-shot examples impacts output reliability.
Chain-of-Thought — Measure Accuracy Improvement
Objective: Empirically demonstrate that CoT improves accuracy on reasoning tasks — so the benefit is concrete, not theoretical.
P4-M11 MASTERY CHECKLIST
- Can make a successful API call to both Anthropic (Claude) and OpenAI with proper authentication
- Know the difference between system, user, and assistant roles — and what each controls
- Know when to use temperature=0.0 vs higher values — and why it matters for AI engineering tasks
- Can write a zero-shot prompt that consistently produces a specific output format
- Can construct few-shot examples that are internally consistent and representative
- Can apply chain-of-thought prompting and measure whether it improved accuracy
- Use XML tags to separate instructions from user-provided content in all production prompts
- Can write a role prompt that specifies behaviour, not just identity
- Know the prefilling technique for Anthropic models and when to use it
- Can systematically debug a failing prompt: identify the failure mode, build test cases, change one thing at a time
- Know the 8 common prompting mistakes and can identify them in existing prompts
- Always include fallback handling ("If you cannot find the answer, say X") for factual tasks
- Completed Lab 1: system prompt isolation experiment
- Completed Lab 2: few-shot example quality comparison
- Completed Lab 3: CoT accuracy measurement
- Milestone project — 5-prompt comparison pushed to GitHub with findings documented
✅ When complete: Move to P4-M12 — Structured Outputs & Tool Calling. The prompting discipline you built here — XML tags, explicit format instructions, few-shot examples — is exactly what makes structured outputs and tool descriptions reliable.