P7-M26 - Prompt Versioning, Cost Monitoring & Caching

Part 7 — Production & Deployment · Module 26 of 27

Prompt Versioning, Cost Monitoring & Caching

Manage prompt changes safely, track spend, and eliminate redundant LLM calls

⏱ 1 Week 🟡 Intermediate 🔧 Git · Redis · Promptfoo · SQLite 📋 Prerequisite: P7-M25

🎯

What This Module Covers

AI-Specific Ops

The operational challenges unique to AI systems: prompts silently change behaviour when edited, LLM costs accumulate invisibly, and identical queries hit the API repeatedly. This module gives you systems for each problem.

Prompt versioning — storing prompts in DB/Git, tracking changes, rollback on regression
Prompt testing — regression testing before deploying a changed prompt
Cost monitoring — per-user, per-endpoint, per-model spend dashboards
Response caching — semantic deduplication, Redis TTL cache for identical queries
Anthropic prompt caching — 90% cost reduction on large repeated system prompts

📝

Prompt Versioning — Never Lose a Working Prompt

Version Control

A prompt is code. Like code, it should be versioned, reviewed, and tested before deployment. A casual edit to a production system prompt can break behaviour for every user — silently.

import sqlite3, hashlib, json
from datetime import datetime
from typing import Optional

# ── DB-backed prompt registry ─────────────────────────
def init_prompt_db():
    with sqlite3.connect("prompts.db") as conn:
        conn.execute("""CREATE TABLE IF NOT EXISTS prompts (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            name        TEXT NOT NULL,
            version     INTEGER NOT NULL,
            content     TEXT NOT NULL,
            hash        TEXT NOT NULL,
            author      TEXT,
            notes       TEXT,
            is_active   INTEGER DEFAULT 0,
            created_at  TEXT NOT NULL,
            UNIQUE(name, version))""")
        conn.execute("CREATE INDEX IF NOT EXISTS idx_name ON prompts(name, is_active)")

def register_prompt(name: str, content: str, author: str = "", notes: str = "") -> int:
    """Register a new version of a prompt. Returns version number."""
    h   = hashlib.sha256(content.encode()).hexdigest()[:12]
    now = datetime.utcnow().isoformat()
    with sqlite3.connect("prompts.db") as conn:
        row = conn.execute(
            "SELECT MAX(version) FROM prompts WHERE name=?", (name,)).fetchone()
        version = (row[0] or 0) + 1
        conn.execute("""INSERT INTO prompts (name,version,content,hash,author,notes,created_at)
            VALUES (?,?,?,?,?,?,?)""", (name, version, content, h, author, notes, now))
    return version

def activate_prompt(name: str, version: int):
    """Activate a specific version — all others for this name become inactive."""
    with sqlite3.connect("prompts.db") as conn:
        conn.execute("UPDATE prompts SET is_active=0 WHERE name=?", (name,))
        conn.execute(
            "UPDATE prompts SET is_active=1 WHERE name=? AND version=?",
            (name, version))

def get_active_prompt(name: str) -> Optional[dict]:
    with sqlite3.connect("prompts.db") as conn:
        row = conn.execute(
            "SELECT content, version, hash FROM prompts WHERE name=? AND is_active=1",
            (name,)).fetchone()
    if not row:
        return None
    return {"content": row[0], "version": row[1], "hash": row[2]}

def rollback_prompt(name: str, to_version: int):
    """Rollback to a previous version."""
    activate_prompt(name, to_version)
    print(f"Rolled back {name!r} to version {to_version}")

def list_prompt_history(name: str) -> list[dict]:
    with sqlite3.connect("prompts.db") as conn:
        rows = conn.execute("""SELECT version, hash, author, is_active, created_at, notes
            FROM prompts WHERE name=? ORDER BY version DESC""", (name,)).fetchall()
    return [{"version": r[0], "hash": r[1], "author": r[2],
             "active": bool(r[3]), "created": r[4], "notes": r[5]} for r in rows]

# Usage workflow:
# v1 = register_prompt("rag_system", "You are a helpful assistant...")     → version 1
# activate_prompt("rag_system", 1)                                          → live
# v2 = register_prompt("rag_system", "You are a precise assistant...")     → version 2
# run_regression_tests("rag_system", v2)  ← test BEFORE activating
# activate_prompt("rag_system", 2)                                          → live
# if metrics worsen: rollback_prompt("rag_system", 1)                      → instant

📁

Git-Based Prompt Management

File-First

# prompts/ directory — treat prompts like source files
#
# prompts/
# ├── rag_system.txt           ← current version
# ├── rag_system.v1.txt        ← archived version
# ├── chat_system.txt
# └── agent_system.txt

from pathlib import Path
import hashlib

PROMPT_DIR = Path("prompts")

def load_prompt(name: str) -> str:
    """Load prompt from file. Falls back to DB if file not found."""
    path = PROMPT_DIR / f"{name}.txt"
    if path.exists():
        return path.read_text(encoding="utf-8")
    # Fall back to DB
    p = get_active_prompt(name)
    return p["content"] if p else ""

def prompt_changed(name: str) -> bool:
    """Detect if the file version differs from the DB active version."""
    file_content = load_prompt(name)
    db_version   = get_active_prompt(name)
    if not db_version:
        return True
    file_hash = hashlib.sha256(file_content.encode()).hexdigest()[:12]
    return file_hash != db_version["hash"]

# CI/CD hook: on prompt file change, require test pass before merge
# .github/workflows/test-prompts.yml
# jobs:
#   test-prompts:
#     steps:
#       - run: python -m pytest tests/test_prompts.py -v
#       - run: python scripts/sync_prompts_to_db.py  # only if tests pass

🧪

Prompt Regression Testing

Test Before Deploy

import pytest, anthropic

client = anthropic.Anthropic()

def call_with_prompt(prompt_content: str, user_message: str) -> str:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        system=prompt_content,
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

# ── Deterministic assertions (temperature=0) ──────────
# These must pass for every prompt version before activation

RAG_PROMPT_V2 = load_prompt_version("rag_system", version=2)

def test_rag_stays_grounded():
    """Prompt must refuse to answer from outside context."""
    reply = call_with_prompt(RAG_PROMPT_V2,
                             "What is the capital of France? (Context: [empty])")
    forbidden = ["Paris", "France"]
    for word in forbidden:
        assert word not in reply, f"Hallucinated '{word}' outside context"

def test_rag_uses_context():
    """Prompt must use provided context."""
    ctx = "The DPDK mempool is initialised with rte_mempool_create()."
    reply = call_with_prompt(RAG_PROMPT_V2,
                             f"Context: {ctx}\n\nHow is DPDK mempool initialised?")
    assert "rte_mempool_create" in reply

def test_rag_declines_gracefully():
    """Prompt must produce the exact 'I don't know' phrase when context empty."""
    reply = call_with_prompt(RAG_PROMPT_V2,
                             "Context: [no documents retrieved]\n\nWhat is VPP?")
    assert "don't have" in reply.lower() or "not contain" in reply.lower()

# ── LLM-as-judge tests (non-deterministic behaviour) ──
from eval_helpers import judge_faithfulness

def test_rag_faithfulness_score():
    """Faithfulness must be >= 0.85 on held-out test set."""
    scores = []
    for case in HELD_OUT_TEST_CASES:
        reply = call_with_prompt(RAG_PROMPT_V2, case["prompt"])
        v = judge_faithfulness(case["context"], reply)
        scores.append(v.score)
    avg = sum(scores) / len(scores)
    assert avg >= 0.85, f"Faithfulness {avg:.3f} < 0.85 threshold"

# Run: pytest tests/test_prompts.py -v
# If tests pass: activate_prompt("rag_system", 2)
# If tests fail: do NOT activate — investigate and fix prompt

💰

Cost Monitoring — Know Where Every Dollar Goes

Financial Control

import sqlite3
from datetime import datetime, timedelta

MODEL_PRICES = {
    "claude-3-5-sonnet-20241022": (3.0/1e6, 15.0/1e6),
    "claude-3-haiku-20240307":    (0.25/1e6, 1.25/1e6),
    "gpt-4o":                     (2.5/1e6, 10.0/1e6),
}

def init_cost_db():
    with sqlite3.connect("costs.db") as conn:
        conn.execute("""CREATE TABLE IF NOT EXISTS llm_calls (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            ts          TEXT NOT NULL,
            model       TEXT NOT NULL,
            endpoint    TEXT NOT NULL,
            user_id     TEXT NOT NULL,
            input_tok   INTEGER, output_tok INTEGER,
            cost_usd    REAL,
            latency_ms  REAL,
            cached      INTEGER DEFAULT 0)""")
        conn.executescript("""
            CREATE INDEX IF NOT EXISTS idx_ts      ON llm_calls(ts);
            CREATE INDEX IF NOT EXISTS idx_user    ON llm_calls(user_id);
            CREATE INDEX IF NOT EXISTS idx_model   ON llm_calls(model);
        """)

def log_llm_call(model: str, endpoint: str, user_id: str,
                 input_tok: int, output_tok: int, latency_ms: float,
                 cached: bool = False):
    p_in, p_out = MODEL_PRICES.get(model, (3e-6, 15e-6))
    cost = input_tok * p_in + output_tok * p_out
    with sqlite3.connect("costs.db") as conn:
        conn.execute("""INSERT INTO llm_calls
            (ts,model,endpoint,user_id,input_tok,output_tok,cost_usd,latency_ms,cached)
            VALUES (?,?,?,?,?,?,?,?,?)""",
            (datetime.utcnow().isoformat(), model, endpoint, user_id,
             input_tok, output_tok, cost, latency_ms, int(cached)))

# ── Reporting queries ─────────────────────────────────
def cost_report(days: int = 30) -> dict:
    cutoff = (datetime.utcnow() - timedelta(days=days)).isoformat()
    with sqlite3.connect("costs.db") as conn:
        total = conn.execute(
            "SELECT SUM(cost_usd), SUM(input_tok+output_tok), COUNT(*) FROM llm_calls WHERE ts>?",
            (cutoff,)).fetchone()
        by_model = conn.execute(
            "SELECT model, SUM(cost_usd), COUNT(*) FROM llm_calls WHERE ts>? GROUP BY model ORDER BY SUM(cost_usd) DESC",
            (cutoff,)).fetchall()
        by_user = conn.execute(
            "SELECT user_id, SUM(cost_usd) FROM llm_calls WHERE ts>? GROUP BY user_id ORDER BY SUM(cost_usd) DESC LIMIT 10",
            (cutoff,)).fetchall()
        cache_savings = conn.execute(
            "SELECT SUM(cost_usd) FROM llm_calls WHERE ts>? AND cached=1",
            (cutoff,)).fetchone()[0] or 0
    return {
        "period_days":    days,
        "total_usd":      round(total[0] or 0, 4),
        "total_tokens":   total[1] or 0,
        "total_calls":    total[2] or 0,
        "cache_savings":  round(cache_savings, 4),
        "by_model":       [{"model": r[0], "cost": round(r[1], 4), "calls": r[2]} for r in by_model],
        "top_users":      [{"user_id": r[0], "cost": round(r[1], 4)} for r in by_user],
    }

⚡

Response Caching — Eliminate Redundant API Calls

Cost + Speed

import redis, hashlib, json
from typing import Optional

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

# ── Exact match cache ─────────────────────────────────
# Same prompt + same system → same deterministic response
# Only valid for temperature=0 calls

def cache_key(system: str, messages: list, model: str) -> str:
    payload = json.dumps({"system": system, "messages": messages,
                          "model": model}, sort_keys=True)
    return f"llm:resp:{hashlib.md5(payload.encode()).hexdigest()}"

def get_cached(system: str, messages: list, model: str,
               ttl_seconds: int = 3600) -> Optional[str]:
    """Check cache. Returns cached response or None."""
    key = cache_key(system, messages, model)
    return r.get(key)

def set_cached(system: str, messages: list, model: str,
               response: str, ttl_seconds: int = 3600):
    key = cache_key(system, messages, model)
    r.setex(key, ttl_seconds, response)

async def cached_llm_call(system: str, messages: list,
                           model: str = "claude-3-5-sonnet-20241022",
                           temperature: float = 0.0) -> tuple[str, bool]:
    """Returns (response_text, was_cached)."""
    if temperature == 0.0:   # only cache deterministic calls
        cached = get_cached(system, messages, model)
        if cached:
            return cached, True

    response = await llm_client.messages.create(
        model=model, max_tokens=1024, temperature=temperature,
        system=system, messages=messages
    )
    text = response.content[0].text

    if temperature == 0.0:
        set_cached(system, messages, model, text)

    return text, False

# ── Semantic cache — cache similar (not just identical) queries ──
# 1. Embed the query
# 2. Search cached embeddings for cosine similarity > threshold
# 3. Return cached response if similar enough

import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95, ttl: int = 3600):
        self.threshold = similarity_threshold
        self.ttl = ttl
        self._entries: list[dict] = []   # in-prod: use vector DB

    def _cosine_sim(self, a, b) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))

    def get(self, query_embedding: list[float]) -> Optional[str]:
        for entry in self._entries:
            sim = self._cosine_sim(query_embedding, entry["embedding"])
            if sim >= self.threshold:
                return entry["response"]
        return None

    def set(self, query_embedding: list[float], response: str):
        self._entries.append({"embedding": query_embedding, "response": response})

💡 Cache hit rate is a key business metric. Even a 20% cache hit rate on RAG queries means 20% fewer LLM API calls — directly reducing cost and latency. Track cache_savings in your cost report (see Tab 3) to show the value of caching to stakeholders.

🔄

Anthropic Prompt Caching — 90% Cost Reduction

Provider Feature

Anthropic's prompt caching caches the KV computation for large system prompts and documents. When the same cached prefix is sent again within 5 minutes, you pay 90% less for those tokens.

import anthropic
client = anthropic.Anthropic()

# ── Cache a large system prompt ───────────────────────
# Use when: same large system prompt sent with every request
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": very_long_system_prompt,   # must be > 1024 tokens for caching to apply
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_question}]
)

# First call: cache_creation_input_tokens = N (full price)
# Subsequent calls within 5 min: cache_read_input_tokens = N (10% price)
print(f"Cache write: {response.usage.cache_creation_input_tokens}")
print(f"Cache read:  {response.usage.cache_read_input_tokens}")

# ── Cache a large document for RAG ────────────────────
# Use when: same large document referenced in many queries
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a document Q&A assistant.",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Here is the DPDK programmer's guide:"},
            {"type": "text", "text": large_dpdk_document,
             "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": user_question}
        ]
    }]
)

# ── When prompt caching is worth it ───────────────────
# Break-even: cache_write_cost = 1.25× normal. Cache reads = 0.1× normal.
# Break-even after 2 cache reads. If a prompt is used 10+ times per 5 min → always worth it.
#
# Best use cases:
# - Long system prompts (>2k tokens) sent with every request
# - Large documents referenced in many RAG queries
# - Few-shot examples in prompts
# - Tool definitions for agents with many tools

FREE LEARNING RESOURCES

Type	Resource	Best For
Docs	Anthropic: Prompt Caching — docs.anthropic.com	Official guide on prompt caching. Covers supported models, cache lifetime, and pricing.
Tool	Promptfoo — github.com/promptfoo/promptfoo	Open-source prompt testing framework. CI/CD integration, regression tests, red-teaming.
Article	Prompt Versioning in Production — hamel.dev	Battle-tested strategies for managing prompts in production ML systems.
Docs	Redis TTL and Expiry — redis.io/docs	Redis TTL mechanics for response cache expiry and keyspace events.

🛠 Prompt Management System + Cost Dashboard [Intermediate] 3–4 days

Build a complete prompt management and cost monitoring system for your AI API.

Requirements

Prompt registry — SQLite-backed register, activate, rollback, history endpoints in FastAPI
Prompt regression tests — pytest suite: grounding test, context-use test, graceful-decline test
Cost logger — log every LLM call to costs.db with model, endpoint, user, tokens, cost
Cost report API — GET /admin/costs returns 30-day report: total, by model, top users, cache savings
Response cache — Redis-backed exact match for temperature=0 calls, 1-hour TTL
Prompt caching — apply cache_control to your RAG system prompt; log cache_read vs cache_write tokens

Skills: SQLite versioning, pytest fixtures, Redis caching, cost analytics, Anthropic prompt caching

LAB 1

Prompt Versioning Lifecycle

Objective: Practise the full register → test → activate → monitor → rollback cycle.

Register your current RAG system prompt as v1. Activate it. Make a deliberate quality-degrading change (remove the "only answer from context" rule). Register as v2.

Run your regression test suite on v2. Verify the grounding test fails (as expected — the change broke it).

Fix the prompt. Register v3. Verify all tests pass on v3. Activate v3.

Verify prompt_changed() returns False (DB matches file). Call list_prompt_history() and verify v1, v2, v3 are all recorded with their authors and timestamps.

LAB 2

Cost Report — Find Your Biggest Spend

Objective: Instrument 100 real API calls and use the cost report to identify optimisation opportunities.

Add log_llm_call() to every LLM call in your M23 API. Run 100 test requests across all endpoints. Generate cost_report(days=1).

Answer from the report: Which model costs the most? Which endpoint uses the most tokens? Which user has the highest spend?

Identify 2 endpoints where you can switch to Haiku instead of Sonnet. Make the switch. Run another 100 requests. Compare the cost reports before and after. What is the % cost reduction?

Add the response cache. Run the same 100 requests again. How many were served from cache? What is cache_savings in the report? What is the effective cost reduction including caching?

LAB 3

Prompt Caching — Measure the Savings

Objective: Add Anthropic prompt caching and measure the real cost reduction.

Take your RAG system prompt (make it long — add extensive instructions until it exceeds 1024 tokens). Log the cache_creation_input_tokens on the first call and cache_read_input_tokens on subsequent calls.

Run 20 queries in rapid succession (within 5 min). For each, print: cache_write, cache_read, total cost. Verify calls 2-20 show cache_read_input_tokens instead of cache_creation_input_tokens.

Calculate: cost without caching (20 × full system prompt cost) vs cost with caching (1 write + 19 reads). What is the % savings?

Wait 6 minutes (beyond the 5-min cache window). Send another request. Verify cache_creation_input_tokens is non-zero again (cache expired). Confirm caching is re-triggered.

P7-M26 MASTERY CHECKLIST

Treat prompts as code: every production prompt is versioned, reviewed, and tested before activation
Can implement a SQLite-backed prompt registry with register, activate, rollback, and history functions
Can detect prompt drift: compare file hash to active DB version with prompt_changed()
Can write at least 3 prompt regression tests: grounding, context-use, graceful-decline
Know the deployment workflow: register → test → activate → monitor → rollback if needed
Can log every LLM call to SQLite: model, endpoint, user_id, tokens, cost, cached flag
Can generate a cost report broken down by model, endpoint, and top users
Can implement Redis-backed exact match response cache for temperature=0 calls
Know that semantic cache requires embedding similarity above a threshold (0.95 is a good starting point)
Know that prompt caching requires >1024 tokens in the cached block to be eligible
Can apply cache_control: ephemeral to system prompt and documents in Anthropic API calls
Know to log cache_creation_input_tokens vs cache_read_input_tokens separately for cost tracking
Know prompt caching TTL is 5 minutes — repeated calls must arrive within 5 min to hit the cache
Completed Lab 1: prompt versioning lifecycle with regression test failure and fix
Completed Lab 2: cost report analysis with model switching and caching impact
Completed Lab 3: Anthropic prompt caching measured with savings calculation
Milestone project: prompt management system + cost dashboard pushed to GitHub

✅ When complete: Move to P7-M27 — MLOps Foundations. The final Part 7 module covers CI/CD for AI, model versioning, and the operational patterns needed for long-running AI products.

← P7-M25: Auth & Logging 🗺️ All Modules Next: P7-M27 — MLOps Foundations →