The ML System Stack
Offline pipeline builds the model — online pipeline serves it
// OFFLINE PIPELINE (batch)
↓
Raw Data Sources
data warehouse, event logs, user actions
↓
Feature Engineering
Spark/Flink jobs computing features from raw data
↓
Feature Store (offline)
Hive / BigQuery / S3 — full history, training data
↓
Training Pipeline
data validation → train → eval → register
↓
Model Registry
versioned artifacts, metrics, approval gate
↓
Deploy to Serving Layer
canary → shadow → full rollout
// ONLINE PIPELINE (real-time)
↓
User Request
API call → recommendation service
↓
Feature Retrieval
Feature Store (online) — Redis / Bigtable, <10ms
↓
Candidate Retrieval
ANN search — top-500 from 10M items, ~15ms
↓
Model Inference
Ranking model scores each candidate, ~40ms
↓
Post-processing
diversity, safety filters, business rules
↓
Response
top-K results returned to user
Training-serving skew — the #1 cause of silent model degradation. If the features computed for training differ even slightly from features computed during serving (different code paths, different data sources, different timestamp handling), the model will perform worse in production than it did offline. The feature store's job is to ensure identical computation for both paths.
Feature Stores
Centralized feature repository — eliminates skew, enables sharing and versioning
REAL-TIME
< 1 second
Computed from streaming events. Kafka → Flink → Redis. Reflects user's most recent actions within the current session.
→ last 5 clicks (60s window)
→ current session duration
→ live cart contents
→ current session duration
→ live cart contents
Store: Redis (TTL 1hr)
NEAR-REAL-TIME
1–60 min
Micro-batch jobs (Spark Structured Streaming). Aggregate windows computed every few minutes. Balances freshness and compute cost.
→ CTR in past hour
→ purchase intent score (1h)
→ trending items right now
→ purchase intent score (1h)
→ trending items right now
Store: Redis / DynamoDB
BATCH
Hours / Days
Daily or hourly Spark / Hive jobs. Historical aggregates over large windows. Cheap to compute at scale, but stale by hours.
→ avg session duration (30d)
→ lifetime purchase value
→ user age / demographics
→ lifetime purchase value
→ user age / demographics
Store: BigQuery / Cassandra
Point-in-time correctness — preventing data leakage in trainingCRITICAL
// Training on historical data: label occurred at time T // Must retrieve feature values AS OF time T, not current values // Using future features to predict the past = data leakage = inflated offline metrics // WRONG — uses today's feature value to predict a past label: features = feature_store.get(entity_id=user_123, feature="avg_session_30d") // This returns the feature as it is TODAY — but the label was from 6 months ago // The feature includes 6 months of future data. Model appears great offline, fails in prod. // CORRECT — point-in-time correct retrieval: features = feature_store.get( entity_id=user_123, feature="avg_session_30d", as_of=label_timestamp ← retrieve value AS OF when the label occurred ) // Feature store must store full history of feature values with timestamps // Offline store schema: // (entity_id, feature_name, value, event_timestamp, created_timestamp) // Query: SELECT value WHERE entity_id=X AND feature_name=Y // AND event_timestamp <= label_timestamp ORDER BY event_timestamp DESC LIMIT 1
Training Pipeline
Six stages from raw data to a registered model artifact
Production training pipeline — all six stagesPIPELINE
// Stage 1: DATA INGESTION // Pull labeled examples from data warehouse, join with feature store data = bigquery.query("SELECT user_id, item_id, label, event_time FROM interactions WHERE date=?") features = feature_store.batch_get(entity_ids=data.user_ids, as_of=data.event_times) dataset = join(data, features) // Stage 2: DATA VALIDATION // Catch data quality issues before training (fail fast) validator.check_schema(dataset) // expected columns present? validator.check_distributions(dataset) // feature values in expected range? validator.check_null_rates(dataset) // null rate < threshold per feature? validator.check_label_balance(dataset) // not extreme class imbalance? // Stage 3: FEATURE PREPROCESSING (saved as artifact → identical at serving time) preprocessor = Pipeline([ StandardScaler(cols=["age", "session_duration"]), OneHotEncoder(cols=["device_type", "country"]), MeanImputer(cols=["avg_purchase_value"]) ]) preprocessor.fit_transform(train_data) // saved as artifact, applied at serving // Stage 4: MODEL TRAINING model = TwoTowerModel(user_dim=256, item_dim=256) trainer = DistributedTrainer(model, gpus=8) // Horovod / PyTorch DDP trainer.train(train_data, epochs=10) // Stage 5: EVALUATION (must beat champion model) metrics = evaluator.eval(model, test_data) if metrics.recall_at_100 <= champion.recall_at_100: pipeline.fail("New model does not beat champion. Stopping.") // Stage 6: MODEL REGISTRATION registry.register(model, { "recall@100": metrics.recall_at_100, "dataset_version": dataset.version, "training_code_hash": git.sha(), "status": "candidate" // awaiting A/B test })
Retraining triggers — three strategies: (1) Scheduled: retrain daily regardless. Simple, catches gradual drift, may miss sudden shifts. (2) Drift-triggered: monitor PSI on feature distributions, retrain when PSI > 0.2. More responsive, requires monitoring infra. (3) Performance-triggered: online business metrics (CTR, conversion) drop > X% — immediate retrain. Most responsive, but online metrics lag by hours.
Model Serving
Batch inference vs real-time inference — and how to deploy safely
Batch Inference
PRE-COMPUTED SCORES · DB LOOKUP AT SERVING
Run model on large dataset of entities offline. Store results. Serving time = just a DB lookup. No model in hot path.
Latency: ~0ms (DB lookup)
Freshness: stale by hours/days
Cost: cheap at serving time
Good for: email recs, non-urgent personalization
Example: pre-compute top-1000 videos
per user nightly → Redis hash
Freshness: stale by hours/days
Cost: cheap at serving time
Good for: email recs, non-urgent personalization
Example: pre-compute top-1000 videos
per user nightly → Redis hash
Real-Time Inference
LIVE MODEL · INFERENCE ON EVERY REQUEST
Model loaded in serving process. Inference runs on each request. Uses real-time features for freshest predictions.
Latency: 10–100ms (model size-dependent)
Freshness: real-time features
Cost: GPU/CPU for every request
Good for: search ranking, ads, fraud detection
Server: TensorFlow Serving, Triton,
TorchServe — with request batching
Freshness: real-time features
Cost: GPU/CPU for every request
Good for: search ranking, ads, fraud detection
Server: TensorFlow Serving, Triton,
TorchServe — with request batching
Model deployment strategies — safe rollout without downtimeDEPLOYMENT
// 1. SHADOW MODE — new model runs alongside old, results NOT shown to users // Log new model's predictions for offline analysis. Zero user impact. user_response = champion_model.predict(features) ← shown to user shadow_result = challenger_model.predict(features) ← logged only logger.log({"shadow_prediction": shadow_result, "actual_label": label}) // 2. CANARY — small % of real traffic to new model if hash(user_id) % 100 < 5: ← 5% canary return challenger_model.predict(features) else: return champion_model.predict(features) // 3. BLUE-GREEN — both versions hot, instant traffic switch // Keep v1 (blue) running. Deploy v2 (green). Validate green. // Switch load balancer: 100% → green. Keep blue for 1hr (fast rollback). load_balancer.set_backend("green") ← instant switch // Automated rollback trigger: if metrics.p99_latency > champion.p99_latency * 1.1: ← 10% regression load_balancer.set_backend("blue") ← instant rollback alerting.page("Challenger model rolled back: latency regression")
Two-Tower Model
The dominant architecture for large-scale recommendation retrieval
// TWO-TOWER ARCHITECTURE — rank 10M items in <100ms
User Tower
Input features:
→ watch history (ids)
→ search queries
→ demographics
→ real-time context
→ watch history (ids)
→ search queries
→ demographics
→ real-time context
Output:
256-dim user embedding
computed at query time
256-dim user embedding
computed at query time
⊙
dot
product
dot
product
Item Tower
Input features:
→ video metadata
→ transcript embedding
→ view/like ratios
→ category / tags
→ video metadata
→ transcript embedding
→ view/like ratios
→ category / tags
Output:
256-dim item embedding
pre-computed offline
256-dim item embedding
pre-computed offline
Two-stage pipeline — retrieval then rankingSERVING FLOW
// STAGE 1: RETRIEVAL — find top-500 from 10M items in ~25ms user_emb = user_tower.embed(user_features) // 256-dim vector, ~5ms candidates = ann_index.search(user_emb, k=500) // ScaNN ANN search, ~15ms // ANN index: 10M items × 256 dims × 4 bytes = ~10 GB → fits in RAM // ScaNN achieves ~95% recall@100 at 10ms for 10M items // STAGE 2: RANKING — score each of 500 candidates precisely for item_id in candidates: item_features = feature_store.get(item_id) // Redis batch fetch cross_features = compute_cross(user, item) // interaction features score = ranking_model.predict(user_features, item_features, cross_features) ranked = sorted(candidates, key=score, reverse=True) // STAGE 3: POST-PROCESSING — business rules on top-100 final = post_processor.apply(ranked[:100], rules=[ FilterWatched(user_id), // don't show already-watched EnforceDiversity(max_per_topic=3), // not 10 Taylor Swift videos FreshnessBoost(hours=24), // boost content <24h old SafetyFilter() // remove policy violations ]) return final[:20] // top 20 to user
A/B Testing at Scale
The only way to measure causal impact — offline metrics are necessary but not sufficient
A/B test setup — the decisions that determine validityDESIGN
// Split by USER (not by request) — same user always sees same model // Splitting by request = same user gets both models = contamination def get_model(user_id): bucket = hash(user_id) % 100 if bucket < 5: ← 5% treatment return challenger_model else: ← 95% control return champion_model // Sample size calculation: // baseline CTR = 2%, minimum detectable effect = 0.2% (10% relative lift) // power = 80%, significance = 0.05 n_per_group = sample_size( baseline=0.02, mde=0.002, power=0.8, alpha=0.05 ) → ~156,000 users per group → ~312K total // Primary metric: business metric (CTR, watch time, conversion rate) // NOT offline AUC — model can improve AUC while hurting business metrics // Guardrail metrics: must not regress (latency p99, revenue, crash rate) // Peeking problem: DO NOT stop early because p < 0.05 after 3 days // Each time you check, you increase false positive rate. // Fix: pre-commit to run duration (2 weeks), use sequential testing if early stopping needed
Peeking
FALSE POSITIVES
Stopping when p<0.05 after 3 days inflates false positive rate. Each check is another chance to see a "significant" result by chance.
Fix: pre-commit to duration.
Use sequential testing (always-valid p-values) if must check early.
Use sequential testing (always-valid p-values) if must check early.
Novelty Effect
INFLATED SHORT-TERM
Users click new UI just because it's new. Effect fades after 1–2 weeks. Short tests overestimate long-term impact.
Fix: run for minimum 2 weeks.
Check if effect is stable in week 2 vs week 1.
Check if effect is stable in week 2 vs week 1.
Network Effects
INTERFERENCE
On social networks, user A (control) interacts with user B (treatment). A is contaminated by B's treatment. Standard splits invalid.
Fix: cluster-based splitting.
Split by social clusters, not individual users.
Split by social clusters, not individual users.
Feature & Concept Drift
Models degrade silently — active monitoring is mandatory in production
Data Drift
COVARIATE SHIFT
Input feature distribution changes. Training users were 18–35, now 13–60. Model was never trained on this input range.
Detection: PSI, KL divergence
on feature distributions daily.
PSI > 0.2 = significant drift → retrain.
on feature distributions daily.
PSI > 0.2 = significant drift → retrain.
Concept Drift
LABEL RELATIONSHIP CHANGES
Feature → label relationship changes. "Free shipping" used to predict high intent; now it's table stakes and non-predictive.
Detection: online metrics (CTR,
conversion) vs offline eval baseline.
Drop > threshold → triggered retrain.
conversion) vs offline eval baseline.
Drop > threshold → triggered retrain.
Label Drift
LABEL DISTRIBUTION SHIFTS
Distribution of labels changes. Fraud rate increases from 0.1% to 0.3% due to a new attack vector. Model calibration is off.
Detection: monitor label frequency
in production feedback pipeline.
Retrain immediately on label drift.
in production feedback pipeline.
Retrain immediately on label drift.
PSI — Population Stability Index — the standard drift metricFORMULA
// PSI = Population Stability Index // Compares distribution of a feature between training (baseline) and production (current) // Higher PSI = more drift // Formula: PSI = Σ (P_current_i - P_baseline_i) × ln(P_current_i / P_baseline_i) // Where P_i = proportion of observations in bucket i (e.g., 10 equal-frequency buckets) // Thresholds (industry standard): PSI < 0.1 → No significant change, model stable PSI 0.1–0.2 → Minor shift, investigate further PSI > 0.2 → Significant shift, retrain model // Monitoring stack (6 daily metrics to track): 1. PSI per feature ← data drift 2. Prediction score dist ← model output drift 3. Null / missing rate ← pipeline health 4. Business metric (CTR) ← concept drift proxy 5. Label distribution ← label drift 6. Inference latency p99 ← serving health
YouTube Recommendations — Full ML System
Connecting all concepts: 2B users, 800M DAU, 100ms budget
// ONLINE LATENCY BUDGET — 100ms total
Feature retrieval
10ms
User embedding compute
5ms
ANN retrieval (10M items)
15ms
Ranking (500 candidates)
40ms
Post-processing
5ms
Network + overhead
25ms
Offline pipeline — daily batch training cycleOFFLINE
// FEATURE COMPUTATION (daily Spark jobs): user_features = compute(watch_history_30d, search_history_7d, demographics) video_features = compute(view_count, like_ratio, avg_watch_pct, transcript_emb) // → Written to BigQuery (offline) + Bigtable (online, low-latency lookup) // TRAINING DATA GENERATION: positives = events WHERE watch_pct > 0.5 // user watched >50% of video negatives = sample(shown_but_not_clicked, n=10) // 10 negatives per positive dataset = join_pit_correct(positives + negatives, feature_store) // TWO-TOWER RETRIEVAL TRAINING: // Goal: Recall@100 (are ground-truth videos in top-100 from 10M?) // Pre-index: all 10M video embeddings → ScaNN index (~10 GB in RAM) // RANKING MODEL TRAINING: // Input: (user_emb, video_emb, cross_features, context) // Output: predicted watch time (regression) // Eval: NDCG@20 on held-out test set // A/B TEST PROMOTION: // New model → 5% canary → primary metric: watch time per session // Guardrail: p99 latency must not exceed 110ms // Run 2 weeks → evaluate → ramp 5% → 10% → 50% → 100%
1
Feature Store for Fraud Detection
›
- List 10 features across user, transaction, and merchant dimensions needed for fraud detection.
- Classify each as batch (daily), near-real-time (5-min), or real-time (<1s). Justify each classification.
- Design the online store schema. What is the key structure (entity_id + feature_name)? What TTL for each tier?
- A transaction arrives. Walk through the full feature retrieval path end-to-end, including which store each feature comes from and estimated latency per call.
- Training time: you have historical transactions from 6 months ago. How do you retrieve the user's "transactions in past hour" feature as it was at the time of each historical transaction? What schema makes this possible?
2
A/B Test Design for New Ranking Model
›
- What is your primary metric? Why not use offline AUC or NDCG as the decision metric?
- Baseline CTR = 2%, minimum detectable effect = 0.2% (absolute), power = 80%, α = 0.05. How many users do you need per group?
- Should you split by user, by session, or by request? What goes wrong with each wrong choice?
- Your test shows p = 0.03 after 3 days. Should you ship? What are the two reasons not to?
- After 2 weeks: treatment shows +3% CTR but −1% average session duration. What do you decide, and what does this tell you about the model?
3
Drift Monitoring System
›
- Your fraud model was trained 6 months ago. Which 3 features are most likely to have drifted? Why?
- Write the PSI formula. What does each bucket represent? What threshold triggers retraining?
- You detect label drift: fraud rate increased from 0.1% to 0.3%. Is this data drift or concept drift? What's your immediate response?
- PSI is high on the "transaction_amount" feature. Before triggering a retrain, how do you determine if this is a real distribution shift vs a data pipeline bug (upstream schema change, null injection)?
- Design the full monitoring dashboard: 6 metrics, alert thresholds, and escalation policy.
★
Design a News Feed Ranking System
›
Design the complete ML system for ranking a social media news feed (LinkedIn or Facebook scale).
- Objective function: engagement maximization leads to clickbait. How do you define an objective that balances engagement, time-well-spent, and content quality?
- Two-stage pipeline: retrieval model (what towers?) + ranking model (what input features?).
- Feature store: define 10 key features with their freshness tier and storage backend.
- Cold-start problem: new user has no history. New post was created 5 minutes ago. How does your system handle both?
- A/B test: primary metric, guardrail metrics, duration, and what split level (user/social cluster)?
- You ship the model, and after 2 weeks users complain the feed feels addictive and manipulative. The model learned to maximize clicks by surfacing outrage. How do you fix this at the ML system level?
0 / 21 completedMODULE C3 · ML SYSTEMS
ML stack: offline pipeline vs online pipeline, training-serving skew
Feature store: offline (training) vs online (serving), point-in-time correctness
Feature freshness tiers: real-time (<1s Redis), near-RT (minutes), batch (hours)
Point-in-time correctness: as_of=timestamp prevents data leakage
Training pipeline: 6 stages from ingestion to model registration
Retraining triggers: scheduled, drift-triggered, performance-triggered
Batch vs real-time inference: latency vs freshness trade-off
Deployment: shadow mode → canary → blue-green with automated rollback
Two-tower model: user tower + item tower, dot product similarity
Two-stage pipeline: ANN retrieval (500 candidates) → ranking → post-processing
ANN search: FAISS/ScaNN — ~15ms for 10M items in RAM
A/B testing: split by user, p-value, power, sample size calculation
A/B pitfalls: peeking, novelty effect, network effects → cluster splitting
Primary metric = business metric, not offline AUC
Data drift vs concept drift vs label drift — definitions and detection
PSI formula, thresholds: <0.1 stable, 0.1-0.2 monitor, >0.2 retrain
6 monitoring metrics: PSI, prediction dist, null rate, CTR, label dist, latency
YouTube recs: offline pipeline + 100ms online budget breakdown
✏️ Task 1: feature store for fraud detection
✏️ Task 2: A/B test design with sample size calculation
✏️ Task 4 (capstone): news feed ranking ML system
// NEXT MODULE
C4 — Observability & SRE
Metrics, logs, traces — the three pillars · SLO/SLI/SLA
Distributed tracing (Jaeger, Zipkin) · Alerting design
Incident response · On-call · Error budgets · Chaos engineering
Distributed tracing (Jaeger, Zipkin) · Alerting design
Incident response · On-call · Error budgets · Chaos engineering