What This Module Covers
AI-Specific OpsThe operational challenges unique to AI systems: prompts silently change behaviour when edited, LLM costs accumulate invisibly, and identical queries hit the API repeatedly. This module gives you systems for each problem.
- Prompt versioning — storing prompts in DB/Git, tracking changes, rollback on regression
- Prompt testing — regression testing before deploying a changed prompt
- Cost monitoring — per-user, per-endpoint, per-model spend dashboards
- Response caching — semantic deduplication, Redis TTL cache for identical queries
- Anthropic prompt caching — 90% cost reduction on large repeated system prompts
Prompt Versioning — Never Lose a Working Prompt
Version ControlA prompt is code. Like code, it should be versioned, reviewed, and tested before deployment. A casual edit to a production system prompt can break behaviour for every user — silently.
import sqlite3, hashlib, json from datetime import datetime from typing import Optional # ── DB-backed prompt registry ───────────────────────── def init_prompt_db(): with sqlite3.connect("prompts.db") as conn: conn.execute("""CREATE TABLE IF NOT EXISTS prompts ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL, version INTEGER NOT NULL, content TEXT NOT NULL, hash TEXT NOT NULL, author TEXT, notes TEXT, is_active INTEGER DEFAULT 0, created_at TEXT NOT NULL, UNIQUE(name, version))""") conn.execute("CREATE INDEX IF NOT EXISTS idx_name ON prompts(name, is_active)") def register_prompt(name: str, content: str, author: str = "", notes: str = "") -> int: """Register a new version of a prompt. Returns version number.""" h = hashlib.sha256(content.encode()).hexdigest()[:12] now = datetime.utcnow().isoformat() with sqlite3.connect("prompts.db") as conn: row = conn.execute( "SELECT MAX(version) FROM prompts WHERE name=?", (name,)).fetchone() version = (row[0] or 0) + 1 conn.execute("""INSERT INTO prompts (name,version,content,hash,author,notes,created_at) VALUES (?,?,?,?,?,?,?)""", (name, version, content, h, author, notes, now)) return version def activate_prompt(name: str, version: int): """Activate a specific version — all others for this name become inactive.""" with sqlite3.connect("prompts.db") as conn: conn.execute("UPDATE prompts SET is_active=0 WHERE name=?", (name,)) conn.execute( "UPDATE prompts SET is_active=1 WHERE name=? AND version=?", (name, version)) def get_active_prompt(name: str) -> Optional[dict]: with sqlite3.connect("prompts.db") as conn: row = conn.execute( "SELECT content, version, hash FROM prompts WHERE name=? AND is_active=1", (name,)).fetchone() if not row: return None return {"content": row[0], "version": row[1], "hash": row[2]} def rollback_prompt(name: str, to_version: int): """Rollback to a previous version.""" activate_prompt(name, to_version) print(f"Rolled back {name!r} to version {to_version}") def list_prompt_history(name: str) -> list[dict]: with sqlite3.connect("prompts.db") as conn: rows = conn.execute("""SELECT version, hash, author, is_active, created_at, notes FROM prompts WHERE name=? ORDER BY version DESC""", (name,)).fetchall() return [{"version": r[0], "hash": r[1], "author": r[2], "active": bool(r[3]), "created": r[4], "notes": r[5]} for r in rows] # Usage workflow: # v1 = register_prompt("rag_system", "You are a helpful assistant...") → version 1 # activate_prompt("rag_system", 1) → live # v2 = register_prompt("rag_system", "You are a precise assistant...") → version 2 # run_regression_tests("rag_system", v2) ← test BEFORE activating # activate_prompt("rag_system", 2) → live # if metrics worsen: rollback_prompt("rag_system", 1) → instant
Git-Based Prompt Management
File-First# prompts/ directory — treat prompts like source files # # prompts/ # ├── rag_system.txt ← current version # ├── rag_system.v1.txt ← archived version # ├── chat_system.txt # └── agent_system.txt from pathlib import Path import hashlib PROMPT_DIR = Path("prompts") def load_prompt(name: str) -> str: """Load prompt from file. Falls back to DB if file not found.""" path = PROMPT_DIR / f"{name}.txt" if path.exists(): return path.read_text(encoding="utf-8") # Fall back to DB p = get_active_prompt(name) return p["content"] if p else "" def prompt_changed(name: str) -> bool: """Detect if the file version differs from the DB active version.""" file_content = load_prompt(name) db_version = get_active_prompt(name) if not db_version: return True file_hash = hashlib.sha256(file_content.encode()).hexdigest()[:12] return file_hash != db_version["hash"] # CI/CD hook: on prompt file change, require test pass before merge # .github/workflows/test-prompts.yml # jobs: # test-prompts: # steps: # - run: python -m pytest tests/test_prompts.py -v # - run: python scripts/sync_prompts_to_db.py # only if tests pass
Prompt Regression Testing
Test Before Deployimport pytest, anthropic
client = anthropic.Anthropic()
def call_with_prompt(prompt_content: str, user_message: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=prompt_content,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
# ── Deterministic assertions (temperature=0) ──────────
# These must pass for every prompt version before activation
RAG_PROMPT_V2 = load_prompt_version("rag_system", version=2)
def test_rag_stays_grounded():
"""Prompt must refuse to answer from outside context."""
reply = call_with_prompt(RAG_PROMPT_V2,
"What is the capital of France? (Context: [empty])")
forbidden = ["Paris", "France"]
for word in forbidden:
assert word not in reply, f"Hallucinated '{word}' outside context"
def test_rag_uses_context():
"""Prompt must use provided context."""
ctx = "The DPDK mempool is initialised with rte_mempool_create()."
reply = call_with_prompt(RAG_PROMPT_V2,
f"Context: {ctx}\n\nHow is DPDK mempool initialised?")
assert "rte_mempool_create" in reply
def test_rag_declines_gracefully():
"""Prompt must produce the exact 'I don't know' phrase when context empty."""
reply = call_with_prompt(RAG_PROMPT_V2,
"Context: [no documents retrieved]\n\nWhat is VPP?")
assert "don't have" in reply.lower() or "not contain" in reply.lower()
# ── LLM-as-judge tests (non-deterministic behaviour) ──
from eval_helpers import judge_faithfulness
def test_rag_faithfulness_score():
"""Faithfulness must be >= 0.85 on held-out test set."""
scores = []
for case in HELD_OUT_TEST_CASES:
reply = call_with_prompt(RAG_PROMPT_V2, case["prompt"])
v = judge_faithfulness(case["context"], reply)
scores.append(v.score)
avg = sum(scores) / len(scores)
assert avg >= 0.85, f"Faithfulness {avg:.3f} < 0.85 threshold"
# Run: pytest tests/test_prompts.py -v
# If tests pass: activate_prompt("rag_system", 2)
# If tests fail: do NOT activate — investigate and fix promptCost Monitoring — Know Where Every Dollar Goes
Financial Controlimport sqlite3
from datetime import datetime, timedelta
MODEL_PRICES = {
"claude-3-5-sonnet-20241022": (3.0/1e6, 15.0/1e6),
"claude-3-haiku-20240307": (0.25/1e6, 1.25/1e6),
"gpt-4o": (2.5/1e6, 10.0/1e6),
}
def init_cost_db():
with sqlite3.connect("costs.db") as conn:
conn.execute("""CREATE TABLE IF NOT EXISTS llm_calls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL,
model TEXT NOT NULL,
endpoint TEXT NOT NULL,
user_id TEXT NOT NULL,
input_tok INTEGER, output_tok INTEGER,
cost_usd REAL,
latency_ms REAL,
cached INTEGER DEFAULT 0)""")
conn.executescript("""
CREATE INDEX IF NOT EXISTS idx_ts ON llm_calls(ts);
CREATE INDEX IF NOT EXISTS idx_user ON llm_calls(user_id);
CREATE INDEX IF NOT EXISTS idx_model ON llm_calls(model);
""")
def log_llm_call(model: str, endpoint: str, user_id: str,
input_tok: int, output_tok: int, latency_ms: float,
cached: bool = False):
p_in, p_out = MODEL_PRICES.get(model, (3e-6, 15e-6))
cost = input_tok * p_in + output_tok * p_out
with sqlite3.connect("costs.db") as conn:
conn.execute("""INSERT INTO llm_calls
(ts,model,endpoint,user_id,input_tok,output_tok,cost_usd,latency_ms,cached)
VALUES (?,?,?,?,?,?,?,?,?)""",
(datetime.utcnow().isoformat(), model, endpoint, user_id,
input_tok, output_tok, cost, latency_ms, int(cached)))
# ── Reporting queries ─────────────────────────────────
def cost_report(days: int = 30) -> dict:
cutoff = (datetime.utcnow() - timedelta(days=days)).isoformat()
with sqlite3.connect("costs.db") as conn:
total = conn.execute(
"SELECT SUM(cost_usd), SUM(input_tok+output_tok), COUNT(*) FROM llm_calls WHERE ts>?",
(cutoff,)).fetchone()
by_model = conn.execute(
"SELECT model, SUM(cost_usd), COUNT(*) FROM llm_calls WHERE ts>? GROUP BY model ORDER BY SUM(cost_usd) DESC",
(cutoff,)).fetchall()
by_user = conn.execute(
"SELECT user_id, SUM(cost_usd) FROM llm_calls WHERE ts>? GROUP BY user_id ORDER BY SUM(cost_usd) DESC LIMIT 10",
(cutoff,)).fetchall()
cache_savings = conn.execute(
"SELECT SUM(cost_usd) FROM llm_calls WHERE ts>? AND cached=1",
(cutoff,)).fetchone()[0] or 0
return {
"period_days": days,
"total_usd": round(total[0] or 0, 4),
"total_tokens": total[1] or 0,
"total_calls": total[2] or 0,
"cache_savings": round(cache_savings, 4),
"by_model": [{"model": r[0], "cost": round(r[1], 4), "calls": r[2]} for r in by_model],
"top_users": [{"user_id": r[0], "cost": round(r[1], 4)} for r in by_user],
}Response Caching — Eliminate Redundant API Calls
Cost + Speedimport redis, hashlib, json from typing import Optional r = redis.Redis(host="localhost", port=6379, decode_responses=True) # ── Exact match cache ───────────────────────────────── # Same prompt + same system → same deterministic response # Only valid for temperature=0 calls def cache_key(system: str, messages: list, model: str) -> str: payload = json.dumps({"system": system, "messages": messages, "model": model}, sort_keys=True) return f"llm:resp:{hashlib.md5(payload.encode()).hexdigest()}" def get_cached(system: str, messages: list, model: str, ttl_seconds: int = 3600) -> Optional[str]: """Check cache. Returns cached response or None.""" key = cache_key(system, messages, model) return r.get(key) def set_cached(system: str, messages: list, model: str, response: str, ttl_seconds: int = 3600): key = cache_key(system, messages, model) r.setex(key, ttl_seconds, response) async def cached_llm_call(system: str, messages: list, model: str = "claude-3-5-sonnet-20241022", temperature: float = 0.0) -> tuple[str, bool]: """Returns (response_text, was_cached).""" if temperature == 0.0: # only cache deterministic calls cached = get_cached(system, messages, model) if cached: return cached, True response = await llm_client.messages.create( model=model, max_tokens=1024, temperature=temperature, system=system, messages=messages ) text = response.content[0].text if temperature == 0.0: set_cached(system, messages, model, text) return text, False # ── Semantic cache — cache similar (not just identical) queries ── # 1. Embed the query # 2. Search cached embeddings for cosine similarity > threshold # 3. Return cached response if similar enough import numpy as np class SemanticCache: def __init__(self, similarity_threshold: float = 0.95, ttl: int = 3600): self.threshold = similarity_threshold self.ttl = ttl self._entries: list[dict] = [] # in-prod: use vector DB def _cosine_sim(self, a, b) -> float: a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)) def get(self, query_embedding: list[float]) -> Optional[str]: for entry in self._entries: sim = self._cosine_sim(query_embedding, entry["embedding"]) if sim >= self.threshold: return entry["response"] return None def set(self, query_embedding: list[float], response: str): self._entries.append({"embedding": query_embedding, "response": response})
💡 Cache hit rate is a key business metric. Even a 20% cache hit rate on RAG queries means 20% fewer LLM API calls — directly reducing cost and latency. Track cache_savings in your cost report (see Tab 3) to show the value of caching to stakeholders.
Anthropic Prompt Caching — 90% Cost Reduction
Provider FeatureAnthropic's prompt caching caches the KV computation for large system prompts and documents. When the same cached prefix is sent again within 5 minutes, you pay 90% less for those tokens.
import anthropic client = anthropic.Anthropic() # ── Cache a large system prompt ─────────────────────── # Use when: same large system prompt sent with every request response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[{ "type": "text", "text": very_long_system_prompt, # must be > 1024 tokens for caching to apply "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": user_question}] ) # First call: cache_creation_input_tokens = N (full price) # Subsequent calls within 5 min: cache_read_input_tokens = N (10% price) print(f"Cache write: {response.usage.cache_creation_input_tokens}") print(f"Cache read: {response.usage.cache_read_input_tokens}") # ── Cache a large document for RAG ──────────────────── # Use when: same large document referenced in many queries response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system="You are a document Q&A assistant.", messages=[{ "role": "user", "content": [ {"type": "text", "text": "Here is the DPDK programmer's guide:"}, {"type": "text", "text": large_dpdk_document, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": user_question} ] }] ) # ── When prompt caching is worth it ─────────────────── # Break-even: cache_write_cost = 1.25× normal. Cache reads = 0.1× normal. # Break-even after 2 cache reads. If a prompt is used 10+ times per 5 min → always worth it. # # Best use cases: # - Long system prompts (>2k tokens) sent with every request # - Large documents referenced in many RAG queries # - Few-shot examples in prompts # - Tool definitions for agents with many tools
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Docs | Anthropic: Prompt Caching — docs.anthropic.com | Official guide on prompt caching. Covers supported models, cache lifetime, and pricing. |
| Tool | Promptfoo — github.com/promptfoo/promptfoo | Open-source prompt testing framework. CI/CD integration, regression tests, red-teaming. |
| Article | Prompt Versioning in Production — hamel.dev | Battle-tested strategies for managing prompts in production ML systems. |
| Docs | Redis TTL and Expiry — redis.io/docs | Redis TTL mechanics for response cache expiry and keyspace events. |
Build a complete prompt management and cost monitoring system for your AI API.
Requirements
- Prompt registry — SQLite-backed register, activate, rollback, history endpoints in FastAPI
- Prompt regression tests — pytest suite: grounding test, context-use test, graceful-decline test
- Cost logger — log every LLM call to costs.db with model, endpoint, user, tokens, cost
- Cost report API — GET /admin/costs returns 30-day report: total, by model, top users, cache savings
- Response cache — Redis-backed exact match for temperature=0 calls, 1-hour TTL
- Prompt caching — apply cache_control to your RAG system prompt; log cache_read vs cache_write tokens
Skills: SQLite versioning, pytest fixtures, Redis caching, cost analytics, Anthropic prompt caching
Prompt Versioning Lifecycle
Objective: Practise the full register → test → activate → monitor → rollback cycle.
Cost Report — Find Your Biggest Spend
Objective: Instrument 100 real API calls and use the cost report to identify optimisation opportunities.
Prompt Caching — Measure the Savings
Objective: Add Anthropic prompt caching and measure the real cost reduction.
P7-M26 MASTERY CHECKLIST
- Treat prompts as code: every production prompt is versioned, reviewed, and tested before activation
- Can implement a SQLite-backed prompt registry with register, activate, rollback, and history functions
- Can detect prompt drift: compare file hash to active DB version with prompt_changed()
- Can write at least 3 prompt regression tests: grounding, context-use, graceful-decline
- Know the deployment workflow: register → test → activate → monitor → rollback if needed
- Can log every LLM call to SQLite: model, endpoint, user_id, tokens, cost, cached flag
- Can generate a cost report broken down by model, endpoint, and top users
- Can implement Redis-backed exact match response cache for temperature=0 calls
- Know that semantic cache requires embedding similarity above a threshold (0.95 is a good starting point)
- Know that prompt caching requires >1024 tokens in the cached block to be eligible
- Can apply cache_control: ephemeral to system prompt and documents in Anthropic API calls
- Know to log cache_creation_input_tokens vs cache_read_input_tokens separately for cost tracking
- Know prompt caching TTL is 5 minutes — repeated calls must arrive within 5 min to hit the cache
- Completed Lab 1: prompt versioning lifecycle with regression test failure and fix
- Completed Lab 2: cost report analysis with model switching and caching impact
- Completed Lab 3: Anthropic prompt caching measured with savings calculation
- Milestone project: prompt management system + cost dashboard pushed to GitHub
✅ When complete: Move to P7-M27 — MLOps Foundations. The final Part 7 module covers CI/CD for AI, model versioning, and the operational patterns needed for long-running AI products.