SYSTEM DESIGN MASTERY · TRACK B · MODULE B12 · WEEK 22 INTERVIEW FRAMEWORK · MOCK INTERVIEWS · CAPSTONE

TRACK B CAPSTONE · 45-MINUTE FRAMEWORK · 6 MOCK INTERVIEWS

SYSTEM DESIGN
INTERVIEW
FRAMEWORK

7-STEP FRAMEWORK · TIME MANAGEMENT · CAPACITY MATH
COMMUNICATION PATTERNS · COMMON MISTAKES · MOCK DRILLS

FRAMEWORK STEPS

45m

INTERVIEW WINDOW

MOCK INTERVIEWS

B12

FINAL MODULE

7-Step Framework

Requirements

Capacity Estimation

Communication

7 Mistakes

6 Mock Problems

Quick Answers

The 7-Step Framework

Use this structure for every system design interview — consistently

Requirements Clarification 5 min

Never start designing before you understand the problem. Interviewers will give partial information intentionally.

→ "How many daily active users are we targeting?" → "What is the read-to-write ratio?" → "Is this globally distributed or single-region?" → "What's the acceptable latency for the critical read path?" → "Strong consistency or eventual consistency acceptable?" → "What's the data retention period?"

Capacity Estimation 5 min

Rough numbers that constrain your design choices. Do the math out loud — it shows structured thinking.

→ QPS = daily_requests / 86,400 × peak_multiplier (3×) → Storage = writes/day × object_size × retention_years → Cache = total_hot_data × 0.2 (80/20 rule) → Bandwidth = peak_QPS × avg_response_size

High-Level Design 10 min

Draw the major components at box-and-arrow level. Cover both write path and read path. Don't over-detail yet.

→ Client → Load Balancer → Service(s) → Cache → DB → Identify: where does data enter, where does it get served → Mention async paths (queues) vs synchronous paths

Data Model & API Design 5 min

Only the tables/schemas that matter for your deep dive. Core API endpoints — method, URL, key fields.

→ Don't design ALL tables. 2–3 critical ones only. → Show partition key / shard key choice → API: GET /timeline/{userId}?cursor=X&limit=20

Deep Dive 15 min

This is where B1–B11 knowledge pays off. Pick 2–3 hard problems in your design and go deep. Show trade-off thinking.

→ Typical: hot read path, write bottleneck, consistency challenge → "For the fan-out problem, I see two approaches..." → "The cache invalidation here is tricky because..."

Bottlenecks & Scaling 4 min

Where does your design break at 10× current scale? Address the biggest SPOFs and hot spots.

→ Single DB primary → add replicas, then shard → Single cache → Redis Cluster → Single region → multi-region with data replication

Summary 1 min

Restate the 3 key decisions you made and why. Mention what you'd do with more time. Leave a strong final impression.

→ "The three key decisions were: fan-out-on-write for the feed, Redis for the hot cache layer, and Cassandra for write-heavy storage." → "With more time, I'd explore multi-region replication."

45-Minute Time Map

The distribution that works — stick to it even under pressure

REQUIREMENTS
CLARIFICATION

CAPACITY
ESTIMATION

10m

HIGH-LEVEL
DESIGN

DATA MODEL
& API

15m

DEEP DIVE
← KEY ★

BOTTLENECKS
& SCALING

SUMMARY
& CLOSE

The Deep Dive is where you are judged. Steps 1–4 are table stakes — everyone can draw boxes. Steps 5–6 (deep dive + scaling) is where senior candidates separate themselves. Protect those 15 minutes fiercely. If you're still doing requirements at minute 10, cut it short and move on.

Capacity Estimation Cheat Sheet

The numbers you need to have memorized before any interview

CONVERSION	SHORTCUT	EXAMPLE
1M req/day → QPS	= 12 QPS sustained; 36 QPS peak	Instagram 100M req/day = 1,200 QPS sustained
1B req/day → QPS	= 12,000 QPS sustained; 36,000 peak	Twitter read traffic ~35K QPS
1 day	≈ 86,400s (use 100K for rough math)	Never say "there are 1000 minutes in a day"
1M users × 1KB	= 1 GB	100M users × 500 bytes = 50 GB
1B users × 1KB	= 1 TB	YouTube metadata: 5B videos × 1KB = 5 TB
1 photo avg	= 1 MB (thumbnail: 50KB)	Instagram 100M uploads/day = 100 TB/day
1 video avg	= 50–500 MB	YouTube 500hr/min upload = ~90 TB/day
1 tweet/message	= 140 bytes – 1 KB	Twitter 500M tweets/day = ~70 GB/day

Worked example: Twitter-scale estimation out loudTECHNIQUE

// Question: Design Twitter. "Let me estimate scale first."

DAU: 300M users
Tweets written: each user tweets 0.5×/day avg → 150M tweets/day
Reads: 100:1 ratio → 15B reads/day → 15B/86400 ≈ 180K read QPS
Writes: 150M/86400 ≈ 1,750 write QPS ≈ 2K write QPS

Storage per tweet: content 140B + metadata 100B + indices ~300B ≈ 550 bytes
Daily storage: 150M × 550B = 82 GB/day
10 years: 82 × 365 × 10 ≈ 300 TB (just tweet text, no media)

Cache for hot tweets: 20% of reads hit 80% of data (Pareto)
Hot set: cache 20% of daily reads = 20% × 180K QPS × avg 500B = ~18 GB hot set

// Now I know: I need a system handling 180K reads/sec, 2K writes/sec,
// ~80 GB/day new storage, 18 GB hot cache. This drives my design choices.

Communication Patterns

What interviewers actually listen for — and what signals seniority

❌ JUNIOR PATTERN

"I'll use Kafka."

"Redis is the best option here."

"MySQL for the database."

[silence — drawing without explanation]

[waits for interviewer to ask about failures]

✓ SENIOR PATTERN

"For 50K write QPS with at-least-once delivery, Kafka fits — though it adds operational complexity."

"Redis solves the hot-read problem here. The trade-off is cache invalidation complexity and sizing."

"For this read:write ratio and need for flexible queries, PostgreSQL — we can shard by user_id later."

"I'm adding a cache here because the read path is 100:1 over writes and most reads are for recent data..."

"Let me think about failure modes. If this service goes down, I want the write path to still work..."

The senior-signal question: Before committing to any technology, say: "I see two approaches here — [A] and [B]. [A] gives us [benefit] but costs [trade-off]. [B] is simpler but doesn't handle [edge case]. Given [constraint from requirements], I'll go with [A]." This shows that you considered alternatives — which is what senior engineers actually do.

7 Common Mistakes

Every one of these has caused otherwise-qualified candidates to fail

JUMPING TO SOLUTIONS WITHOUT REQUIREMENTS

Fix: Force yourself to spend 5 full minutes on requirements. Repeat back: "So I'm building a system for X users, Y QPS, with Z consistency requirement — is that correct?" Get explicit confirmation before drawing anything.

DESIGNING FOR ONE SERVER

Fix: Always think distributed by default. Even for "simple" systems, ask about scale first. A system with 10K QPS requires load balancing, connection pooling, and at least 2 servers. Assume you need to scale.

AVOIDING TRADE-OFFS — "THIS SOLUTION HANDLES EVERYTHING"

Fix: Every technology choice has costs. If you pick Kafka: mention the latency overhead, operational complexity, at-least-once semantics. If you pick Cassandra: mention you can't do JOINs, no strong consistency. Acknowledging trade-offs shows maturity.

NOT KNOWING THE NUMBERS

Fix: Memorize the estimation table. Doing math out loud (even approximations) shows discipline. "1B req/day ÷ 100K seconds = 10K QPS × 3 peak = 30K peak QPS" said in 10 seconds is far more impressive than "should be fine."

DESIGNING ALONE, NOT COLLABORATING

Fix: Check in every 5–8 minutes. "I'm thinking of separating read and write paths here — does that seem like the right direction to explore?" Interviewing is a collaborative exercise. Interviewers want to see how you work with others.

SPENDING 20 MINUTES ON CRUD

Fix: Cover the obvious parts quickly: "Standard REST CRUD, JWT auth, HTTPS everywhere — 2 minutes." Save your time for the hard distributed systems problems: fan-out, consistency, failure handling. That's where the interview is won or lost.

NO FAILURE HANDLING IN THE DESIGN

Fix: Proactively discuss failures. "What if the cache is unavailable? I'd add a circuit breaker and fall back to DB reads." "What if the notification service is slow? I'd make it async with a queue and retry." Cover the top 2–3 failure scenarios before being asked.

6 Mock Interview Problems

One per day — timed, 45 minutes, no notes until after

MOCK 1 · DAY 1 · WARMUP

Design Pastebin / URL Shortener

EASY

1M pastes/day · 100M reads/day · max 10 MB paste

Key decisions: ID generation (base62), text in S3 vs DB, expiry strategy, CDN for large pastes, async analytics.

Deep dive: cache tier for popular pastes, ID collision handling, lazy vs background expiry cleanup

MOCK 2 · DAY 2 · COMPONENT

Design a Notification System

MEDIUM

10M notifications/day · push + email + SMS · <30s delivery

Key decisions: channel routing (FCM/APNs/SES/Twilio), priority queues, user preferences, retry + fallback logic, deduplication.

Deep dive: retry/fallback when push fails → SMS, dedup across channels, delivery receipts

MOCK 3 · DAY 3 · DISTRIBUTED

Distributed Job Scheduler

MEDIUM-HARD

1M jobs · 1K jobs/sec peak triggering · at-most-once execution

Key decisions: storage (DB partitioned by scheduled_time), time-wheel vs priority queue, leader election, exactly-once execution, dead job recovery.

Deep dive: preventing two nodes from running same job, crash recovery, cron expression parsing

MOCK 4 · DAY 4 · HARD

Design Google Drive / Dropbox

HARD

50M DAU · 100M uploads/day · avg 500KB · 10 PB total

Key decisions: chunked upload (4MB chunks, SHA-256 dedup), delta sync (changed chunks only), metadata DB + S3, sync protocol, conflict resolution.

Deep dive: chunk deduplication across users, delta sync algorithm, last-write-wins vs OT conflict resolution

MOCK 5 · DAY 5 · HARD

Live Streaming Platform

HARD

1K streamers · 10M viewers · 100K viewers/top stream · <10s latency

Key decisions: RTMP ingest → HLS transcode, CDN fan-out, WebSocket chat, viewer count (HyperLogLog), HLS vs WebRTC latency trade-off.

Deep dive: transcoding pipeline parallelism, chat fan-out at 100K viewers, approximate viewer count

MOCK 6 · DAY 6 · SYNTHESIS

Search Autocomplete

MEDIUM

10K autocomplete QPS · 100M unique queries/day in logs · top-10 suggestions

Key decisions: trie vs inverted index, precompute top-N per prefix, daily batch update pipeline from logs, shard trie by prefix range.

Deep dive: trie sharding, pre-computation vs on-the-fly, unicode + multilingual handling

Practice protocol: Set a 45-minute timer. Draw on paper or a whiteboard. No notes. After time is up, review against the module notes and identify the 2–3 things you missed. Do NOT review the answer before attempting — the discomfort of not knowing is the practice.

Quick Answer Cheat Sheet

Interviewer probes — have these answers ready in 30 seconds

"How do you handle hot partitions / hot keys?"

›

Add a random suffix (0–9) to the partition key to spread load across 10× more partitions. Combine results at read time. Alternatively, cache hot keys separately in Redis with a short TTL. For write-heavy hot keys, use a write-behind cache. Consistent hashing with virtual nodes also mitigates hot spots by distributing keys more evenly.

"How do you prevent thundering herd on cache expiry?"

›

Three approaches: (1) Probabilistic early expiration — with probability proportional to time-to-expiry, refresh early before expiry hits all at once. (2) Mutex/lock on cache miss — first thread refreshes, others wait. Use Redis SETNX as a distributed lock with short TTL. (3) Background refresh — proactively refresh popular keys before expiry, keeping them always warm.

"How do you achieve exactly-once processing in Kafka?"

›

Kafka delivers at-least-once by default. To achieve effectively-once: use idempotent consumers — check an idempotency key in the DB before processing, and store it atomically with the result. For stricter needs, Kafka Transactions (EOS — exactly-once semantics) enable atomic produce+consume operations. The outbox pattern (B11) combined with idempotent consumers gives effectively-exactly-once end-to-end.

"How do you handle cascading failures between services?"

›

Circuit breaker pattern: after N consecutive failures to a downstream service, the circuit "opens" — requests fail fast without hitting the service. After a timeout, a "half-open" probe is sent; if it succeeds, circuit closes. Bulkhead pattern: isolate thread pools per downstream service so one slow service doesn't exhaust the shared pool. Timeout + retry with exponential backoff for transient failures. Fallback: cached result, degraded response, or graceful error.

"How would you design for multi-region?"

›

Active-active: both regions serve reads and writes, data replicated asynchronously (DynamoDB Global Tables, CockroachDB). Conflict resolution needed. Active-passive: primary region handles writes, secondary is hot standby — failover when primary goes down. Latency-based routing (Route 53) sends users to nearest region. GDPR: EU user data must stay in EU region — use regional data classification. RPO/RTO: async replication has seconds of potential data loss (RPO); failover automation targets minutes (RTO).

"How do you do zero-downtime schema migrations?"

›

Expand-contract (also called parallel change): Step 1 — Add new column (nullable, no default); existing code ignores it. Step 2 — Dual-write: new code writes to both old and new columns. Step 3 — Backfill: migrate old data to new column via background job. Step 4 — Switch reads: code now reads from new column. Step 5 — Stop writing to old column. Step 6 — Drop old column. Never run ALTER TABLE on a large live table without this — it takes an exclusive lock and blocks all queries.

"How do you ensure high availability for stateful services?"

›

Run multiple instances behind a load balancer. For sessions: externalize state to Redis (stateless app servers). For leader election (e.g., Saga Orchestrator): use Zookeeper ephemeral nodes or etcd leases — leader holds a lease, followers compete to acquire it on expiry. Health checks: remove unhealthy instances from rotation within 10–30 seconds. DB: primary-replica with automatic failover (RDS Multi-AZ, Patroni for Postgres).

0 / 13 completedMODULE B12 · INTERVIEW FRAMEWORK

7-step framework memorized — can recite steps + time allocations without notes

6 requirements questions to always ask (DAU, ratio, latency, consistency, geo, retention)

Capacity math: 1B/day = 12K QPS, 1M users × 1KB = 1GB, 1 photo = 1MB

Communication: state reasoning before answer, proactive trade-offs, drive conversation

7 mistakes internalized — know what symptom looks like and how to avoid it

Quick answers: hot partitions, thundering herd, exactly-once, cascading failures, multi-region

✏️ Mock 1 (Pastebin) — 45-min timed session completed

✏️ Mock 2 (Notifications) — 45-min timed session completed

✏️ Mock 3 (Job Scheduler) — 45-min timed session completed

✏️ Mock 4 (Google Drive) — 45-min timed session completed

✏️ Mock 5 (Live Streaming) — 45-min timed session completed

✏️ Mock 6 (Autocomplete) — 45-min timed session completed

All 6 mocks reviewed against module notes — gaps identified and studied

TRACK B — COMPLETE

Track B: HLD Mastered

      B1 Fundamentals · B2 Databases · B3 Caching · B4 Message Queues

      B5 URL Shortener · B6 Twitter Feed · B7 WhatsApp · B8 YouTube

      B9 Rate Limiter · B10 Consistent Hashing · B11 Distributed Tx · B12 Interview Framework

      NEXT: Track A (LLD) Complete · Track B (HLD) Complete · Ready for FAANG Interviews