SYSTEM DESIGN MASTERY · TRACK B · MODULE B13 · WEEK 23 FEATURE STORES · MODEL SERVING · TWO-TOWER · A/B TESTING · DRIFT

// TRACK B · HLD · ML INFRASTRUCTURE

ML Systems
Design

FEATURE STORES · TRAINING PIPELINES · MODEL SERVING
TWO-TOWER RETRIEVAL · A/B TESTING · DRIFT DETECTION

2-stage

RETRIEVAL+RANK

100ms

LATENCY BUDGET

DRIFT TYPES

B13

MODULE

Feature Store

Training Pipeline

Model Serving

Two-Tower Model

A/B Testing

Drift Detection

YouTube Recs

The ML System Stack

Offline pipeline builds the model — online pipeline serves it

// OFFLINE PIPELINE (batch)

↓

Raw Data Sources

data warehouse, event logs, user actions

↓

Feature Engineering

Spark/Flink jobs computing features from raw data

↓

Feature Store (offline)

Hive / BigQuery / S3 — full history, training data

↓

Training Pipeline

data validation → train → eval → register

↓

Model Registry

versioned artifacts, metrics, approval gate

↓

Deploy to Serving Layer

canary → shadow → full rollout

// ONLINE PIPELINE (real-time)

↓

User Request

API call → recommendation service

↓

Feature Retrieval

Feature Store (online) — Redis / Bigtable, <10ms

↓

Candidate Retrieval

ANN search — top-500 from 10M items, ~15ms

↓

Model Inference

Ranking model scores each candidate, ~40ms

↓

Post-processing

diversity, safety filters, business rules

↓

Response

top-K results returned to user

Training-serving skew — the #1 cause of silent model degradation. If the features computed for training differ even slightly from features computed during serving (different code paths, different data sources, different timestamp handling), the model will perform worse in production than it did offline. The feature store's job is to ensure identical computation for both paths.

Feature Stores

Centralized feature repository — eliminates skew, enables sharing and versioning

REAL-TIME

< 1 second

Computed from streaming events. Kafka → Flink → Redis. Reflects user's most recent actions within the current session.

→ last 5 clicks (60s window)
→ current session duration
→ live cart contents

Store: Redis (TTL 1hr)

NEAR-REAL-TIME

1–60 min

Micro-batch jobs (Spark Structured Streaming). Aggregate windows computed every few minutes. Balances freshness and compute cost.

→ CTR in past hour
→ purchase intent score (1h)
→ trending items right now

Store: Redis / DynamoDB

BATCH

Hours / Days

Daily or hourly Spark / Hive jobs. Historical aggregates over large windows. Cheap to compute at scale, but stale by hours.

→ avg session duration (30d)
→ lifetime purchase value
→ user age / demographics

Store: BigQuery / Cassandra

Point-in-time correctness — preventing data leakage in trainingCRITICAL

// Training on historical data: label occurred at time T
// Must retrieve feature values AS OF time T, not current values
// Using future features to predict the past = data leakage = inflated offline metrics

// WRONG — uses today's feature value to predict a past label:
features = feature_store.get(entity_id=user_123, feature="avg_session_30d")
// This returns the feature as it is TODAY — but the label was from 6 months ago
// The feature includes 6 months of future data. Model appears great offline, fails in prod.

// CORRECT — point-in-time correct retrieval:
features = feature_store.get(
    entity_id=user_123,
    feature="avg_session_30d",
    as_of=label_timestamp  ← retrieve value AS OF when the label occurred
)

// Feature store must store full history of feature values with timestamps
// Offline store schema:
// (entity_id, feature_name, value, event_timestamp, created_timestamp)
// Query: SELECT value WHERE entity_id=X AND feature_name=Y
//        AND event_timestamp <= label_timestamp ORDER BY event_timestamp DESC LIMIT 1

Training Pipeline

Six stages from raw data to a registered model artifact

Production training pipeline — all six stagesPIPELINE

// Stage 1: DATA INGESTION
// Pull labeled examples from data warehouse, join with feature store
data = bigquery.query("SELECT user_id, item_id, label, event_time FROM interactions WHERE date=?")
features = feature_store.batch_get(entity_ids=data.user_ids, as_of=data.event_times)
dataset = join(data, features)

// Stage 2: DATA VALIDATION
// Catch data quality issues before training (fail fast)
validator.check_schema(dataset)           // expected columns present?
validator.check_distributions(dataset)    // feature values in expected range?
validator.check_null_rates(dataset)       // null rate < threshold per feature?
validator.check_label_balance(dataset)    // not extreme class imbalance?

// Stage 3: FEATURE PREPROCESSING (saved as artifact → identical at serving time)
preprocessor = Pipeline([
    StandardScaler(cols=["age", "session_duration"]),
    OneHotEncoder(cols=["device_type", "country"]),
    MeanImputer(cols=["avg_purchase_value"])
])
preprocessor.fit_transform(train_data)    // saved as artifact, applied at serving

// Stage 4: MODEL TRAINING
model = TwoTowerModel(user_dim=256, item_dim=256)
trainer = DistributedTrainer(model, gpus=8)   // Horovod / PyTorch DDP
trainer.train(train_data, epochs=10)

// Stage 5: EVALUATION (must beat champion model)
metrics = evaluator.eval(model, test_data)
if metrics.recall_at_100 <= champion.recall_at_100:
    pipeline.fail("New model does not beat champion. Stopping.")

// Stage 6: MODEL REGISTRATION
registry.register(model, {
    "recall@100": metrics.recall_at_100,
    "dataset_version": dataset.version,
    "training_code_hash": git.sha(),
    "status": "candidate"   // awaiting A/B test
})

Retraining triggers — three strategies: (1) Scheduled: retrain daily regardless. Simple, catches gradual drift, may miss sudden shifts. (2) Drift-triggered: monitor PSI on feature distributions, retrain when PSI > 0.2. More responsive, requires monitoring infra. (3) Performance-triggered: online business metrics (CTR, conversion) drop > X% — immediate retrain. Most responsive, but online metrics lag by hours.

Model Serving

Batch inference vs real-time inference — and how to deploy safely

Batch Inference

PRE-COMPUTED SCORES · DB LOOKUP AT SERVING

Run model on large dataset of entities offline. Store results. Serving time = just a DB lookup. No model in hot path.

Latency: ~0ms (DB lookup)
Freshness: stale by hours/days
Cost: cheap at serving time
Good for: email recs, non-urgent personalization

Example: pre-compute top-1000 videos
per user nightly → Redis hash

Real-Time Inference

LIVE MODEL · INFERENCE ON EVERY REQUEST

Model loaded in serving process. Inference runs on each request. Uses real-time features for freshest predictions.

Latency: 10–100ms (model size-dependent)
Freshness: real-time features
Cost: GPU/CPU for every request
Good for: search ranking, ads, fraud detection

Server: TensorFlow Serving, Triton,
TorchServe — with request batching

Model deployment strategies — safe rollout without downtimeDEPLOYMENT

// 1. SHADOW MODE — new model runs alongside old, results NOT shown to users
//    Log new model's predictions for offline analysis. Zero user impact.
user_response = champion_model.predict(features)    ← shown to user
shadow_result  = challenger_model.predict(features)  ← logged only
logger.log({"shadow_prediction": shadow_result, "actual_label": label})

// 2. CANARY — small % of real traffic to new model
if hash(user_id) % 100 < 5:    ← 5% canary
    return challenger_model.predict(features)
else:
    return champion_model.predict(features)

// 3. BLUE-GREEN — both versions hot, instant traffic switch
//    Keep v1 (blue) running. Deploy v2 (green). Validate green.
//    Switch load balancer: 100% → green. Keep blue for 1hr (fast rollback).
load_balancer.set_backend("green")   ← instant switch

// Automated rollback trigger:
if metrics.p99_latency > champion.p99_latency * 1.1:   ← 10% regression
    load_balancer.set_backend("blue")  ← instant rollback
    alerting.page("Challenger model rolled back: latency regression")

Two-Tower Model

The dominant architecture for large-scale recommendation retrieval

// TWO-TOWER ARCHITECTURE — rank 10M items in <100ms

User Tower

Input features:
→ watch history (ids)
→ search queries
→ demographics
→ real-time context

Output:
256-dim user embedding
computed at query time

⊙
dot
product

Item Tower

Input features:
→ video metadata
→ transcript embedding
→ view/like ratios
→ category / tags

Output:
256-dim item embedding
pre-computed offline

Two-stage pipeline — retrieval then rankingSERVING FLOW

// STAGE 1: RETRIEVAL — find top-500 from 10M items in ~25ms
user_emb = user_tower.embed(user_features)     // 256-dim vector, ~5ms
candidates = ann_index.search(user_emb, k=500)  // ScaNN ANN search, ~15ms
// ANN index: 10M items × 256 dims × 4 bytes = ~10 GB → fits in RAM
// ScaNN achieves ~95% recall@100 at 10ms for 10M items

// STAGE 2: RANKING — score each of 500 candidates precisely
for item_id in candidates:
    item_features = feature_store.get(item_id)    // Redis batch fetch
    cross_features = compute_cross(user, item)    // interaction features
    score = ranking_model.predict(user_features, item_features, cross_features)

ranked = sorted(candidates, key=score, reverse=True)

// STAGE 3: POST-PROCESSING — business rules on top-100
final = post_processor.apply(ranked[:100], rules=[
    FilterWatched(user_id),         // don't show already-watched
    EnforceDiversity(max_per_topic=3), // not 10 Taylor Swift videos
    FreshnessBoost(hours=24),         // boost content <24h old
    SafetyFilter()                    // remove policy violations
])
return final[:20]  // top 20 to user

A/B Testing at Scale

The only way to measure causal impact — offline metrics are necessary but not sufficient

A/B test setup — the decisions that determine validityDESIGN

// Split by USER (not by request) — same user always sees same model
// Splitting by request = same user gets both models = contamination
def get_model(user_id):
    bucket = hash(user_id) % 100
    if bucket < 5:   ← 5% treatment
        return challenger_model
    else:              ← 95% control
        return champion_model

// Sample size calculation:
// baseline CTR = 2%, minimum detectable effect = 0.2% (10% relative lift)
// power = 80%, significance = 0.05
n_per_group = sample_size(
    baseline=0.02, mde=0.002, power=0.8, alpha=0.05
)  → ~156,000 users per group → ~312K total

// Primary metric: business metric (CTR, watch time, conversion rate)
// NOT offline AUC — model can improve AUC while hurting business metrics
// Guardrail metrics: must not regress (latency p99, revenue, crash rate)

// Peeking problem: DO NOT stop early because p < 0.05 after 3 days
// Each time you check, you increase false positive rate.
// Fix: pre-commit to run duration (2 weeks), use sequential testing if early stopping needed

Peeking

FALSE POSITIVES

Stopping when p<0.05 after 3 days inflates false positive rate. Each check is another chance to see a "significant" result by chance.

Fix: pre-commit to duration.
Use sequential testing (always-valid p-values) if must check early.

Novelty Effect

INFLATED SHORT-TERM

Users click new UI just because it's new. Effect fades after 1–2 weeks. Short tests overestimate long-term impact.

Fix: run for minimum 2 weeks.
Check if effect is stable in week 2 vs week 1.

Network Effects

INTERFERENCE

On social networks, user A (control) interacts with user B (treatment). A is contaminated by B's treatment. Standard splits invalid.

Fix: cluster-based splitting.
Split by social clusters, not individual users.

Feature & Concept Drift

Models degrade silently — active monitoring is mandatory in production

Data Drift

COVARIATE SHIFT

Input feature distribution changes. Training users were 18–35, now 13–60. Model was never trained on this input range.

Detection: PSI, KL divergence
on feature distributions daily.
PSI > 0.2 = significant drift → retrain.

Concept Drift

LABEL RELATIONSHIP CHANGES

Feature → label relationship changes. "Free shipping" used to predict high intent; now it's table stakes and non-predictive.

Detection: online metrics (CTR,
conversion) vs offline eval baseline.
Drop > threshold → triggered retrain.

Label Drift

LABEL DISTRIBUTION SHIFTS

Distribution of labels changes. Fraud rate increases from 0.1% to 0.3% due to a new attack vector. Model calibration is off.

Detection: monitor label frequency
in production feedback pipeline.
Retrain immediately on label drift.

PSI — Population Stability Index — the standard drift metricFORMULA

// PSI = Population Stability Index
// Compares distribution of a feature between training (baseline) and production (current)
// Higher PSI = more drift

// Formula:
PSI = Σ (P_current_i - P_baseline_i) × ln(P_current_i / P_baseline_i)

// Where P_i = proportion of observations in bucket i (e.g., 10 equal-frequency buckets)

// Thresholds (industry standard):
PSI < 0.1   → No significant change, model stable
PSI 0.1–0.2 → Minor shift, investigate further
PSI > 0.2   → Significant shift, retrain model

// Monitoring stack (6 daily metrics to track):
1. PSI per feature           ← data drift
2. Prediction score dist     ← model output drift
3. Null / missing rate       ← pipeline health
4. Business metric (CTR)     ← concept drift proxy
5. Label distribution        ← label drift
6. Inference latency p99     ← serving health

YouTube Recommendations — Full ML System

Connecting all concepts: 2B users, 800M DAU, 100ms budget

// ONLINE LATENCY BUDGET — 100ms total

Feature retrieval

10ms

User embedding compute

5ms

ANN retrieval (10M items)

15ms

Ranking (500 candidates)

40ms

Post-processing

5ms

Network + overhead

25ms

Offline pipeline — daily batch training cycleOFFLINE

// FEATURE COMPUTATION (daily Spark jobs):
user_features  = compute(watch_history_30d, search_history_7d, demographics)
video_features = compute(view_count, like_ratio, avg_watch_pct, transcript_emb)
// → Written to BigQuery (offline) + Bigtable (online, low-latency lookup)

// TRAINING DATA GENERATION:
positives = events WHERE watch_pct > 0.5        // user watched >50% of video
negatives = sample(shown_but_not_clicked, n=10) // 10 negatives per positive
dataset   = join_pit_correct(positives + negatives, feature_store)

// TWO-TOWER RETRIEVAL TRAINING:
// Goal: Recall@100 (are ground-truth videos in top-100 from 10M?)
// Pre-index: all 10M video embeddings → ScaNN index (~10 GB in RAM)

// RANKING MODEL TRAINING:
// Input: (user_emb, video_emb, cross_features, context)
// Output: predicted watch time (regression)
// Eval: NDCG@20 on held-out test set

// A/B TEST PROMOTION:
// New model → 5% canary → primary metric: watch time per session
// Guardrail: p99 latency must not exceed 110ms
// Run 2 weeks → evaluate → ramp 5% → 10% → 50% → 100%

Feature Store for Fraud Detection

~1.5 hrs

›

List 10 features across user, transaction, and merchant dimensions needed for fraud detection.
Classify each as batch (daily), near-real-time (5-min), or real-time (<1s). Justify each classification.
Design the online store schema. What is the key structure (entity_id + feature_name)? What TTL for each tier?
A transaction arrives. Walk through the full feature retrieval path end-to-end, including which store each feature comes from and estimated latency per call.
Training time: you have historical transactions from 6 months ago. How do you retrieve the user's "transactions in past hour" feature as it was at the time of each historical transaction? What schema makes this possible?

A/B Test Design for New Ranking Model

~1 hr

›

What is your primary metric? Why not use offline AUC or NDCG as the decision metric?
Baseline CTR = 2%, minimum detectable effect = 0.2% (absolute), power = 80%, α = 0.05. How many users do you need per group?
Should you split by user, by session, or by request? What goes wrong with each wrong choice?
Your test shows p = 0.03 after 3 days. Should you ship? What are the two reasons not to?
After 2 weeks: treatment shows +3% CTR but −1% average session duration. What do you decide, and what does this tell you about the model?

Drift Monitoring System

~1.5 hrs

›

Your fraud model was trained 6 months ago. Which 3 features are most likely to have drifted? Why?
Write the PSI formula. What does each bucket represent? What threshold triggers retraining?
You detect label drift: fraud rate increased from 0.1% to 0.3%. Is this data drift or concept drift? What's your immediate response?
PSI is high on the "transaction_amount" feature. Before triggering a retrain, how do you determine if this is a real distribution shift vs a data pipeline bug (upstream schema change, null injection)?
Design the full monitoring dashboard: 6 metrics, alert thresholds, and escalation policy.

★

Design a News Feed Ranking System

~3 hrs

›

Design the complete ML system for ranking a social media news feed (LinkedIn or Facebook scale).

Objective function: engagement maximization leads to clickbait. How do you define an objective that balances engagement, time-well-spent, and content quality?
Two-stage pipeline: retrieval model (what towers?) + ranking model (what input features?).
Feature store: define 10 key features with their freshness tier and storage backend.
Cold-start problem: new user has no history. New post was created 5 minutes ago. How does your system handle both?
A/B test: primary metric, guardrail metrics, duration, and what split level (user/social cluster)?
You ship the model, and after 2 weeks users complain the feed feels addictive and manipulative. The model learned to maximize clicks by surfacing outrage. How do you fix this at the ML system level?

0 / 21 completedMODULE C3 · ML SYSTEMS

ML stack: offline pipeline vs online pipeline, training-serving skew

Feature store: offline (training) vs online (serving), point-in-time correctness

Feature freshness tiers: real-time (<1s Redis), near-RT (minutes), batch (hours)

Point-in-time correctness: as_of=timestamp prevents data leakage

Training pipeline: 6 stages from ingestion to model registration

Retraining triggers: scheduled, drift-triggered, performance-triggered

Batch vs real-time inference: latency vs freshness trade-off

Deployment: shadow mode → canary → blue-green with automated rollback

Two-tower model: user tower + item tower, dot product similarity

Two-stage pipeline: ANN retrieval (500 candidates) → ranking → post-processing

ANN search: FAISS/ScaNN — ~15ms for 10M items in RAM

A/B testing: split by user, p-value, power, sample size calculation

A/B pitfalls: peeking, novelty effect, network effects → cluster splitting

Primary metric = business metric, not offline AUC

Data drift vs concept drift vs label drift — definitions and detection

PSI formula, thresholds: <0.1 stable, 0.1-0.2 monitor, >0.2 retrain

6 monitoring metrics: PSI, prediction dist, null rate, CTR, label dist, latency

YouTube recs: offline pipeline + 100ms online budget breakdown

✏️ Task 1: feature store for fraud detection

✏️ Task 2: A/B test design with sample size calculation

✏️ Task 4 (capstone): news feed ranking ML system

// NEXT MODULE

C4 — Observability & SRE

      Metrics, logs, traces — the three pillars · SLO/SLI/SLA

      Distributed tracing (Jaeger, Zipkin) · Alerting design

      Incident response · On-call · Error budgets · Chaos engineering

← B12 Interview Framework 📄 Study Notes ↑ Roadmap B14 Kubernetes →