Track B · HLD Case Study · Module B5 · Week 15

URL
Shortener

END-TO-END HLD CASE STUDY
300M URLS · 100:1 READ:WRITE · <10ms REDIRECT
3.5T
UNIQUE CODES
150GB
STORAGE
99%
CACHE HIT
B5
MODULE
POST /shorten
GET /{code} → 302
base62 encoding
Redis cache
Kafka analytics
Rate limiting
Expiry / TTL
Requirements
Functional + Non-Functional — establish scope before drawing any boxes
Functional Requirements
POST /shorten → given long URL, return short code
GET /{code} → redirect to original URL (302)
Custom alias → user specifies their own code
TTL / expiry → optional expiration date per URL
Analytics → click count, referrer, geo (optional)

OUT OF SCOPE: user accounts, dashboard, link editing
Non-Functional Requirements
300M URLs stored total
Write: 100 new URLs/sec (peak: 500/sec)
Read: 10,000 redirects/sec (peak: 50,000/sec)
Read:write ratio = 100:1
Redirect p99 latency < 10ms
Shorten p99 latency < 100ms
Availability: 99.99% (52 min downtime/year)
Durability: URLs must never be lost
Key insight to say aloud: "The 100:1 read:write ratio tells me this is a read-heavy system. My entire architecture will be optimised for fast redirects — nearly every redirect should be served from cache without touching the database."
Capacity Estimation
Numbers first — they drive every architecture decision
METRICVALUECALCULATION
Write QPS (avg)100/secgiven requirement
Write QPS (peak)500/sec5× average peak factor
Read QPS (avg)10,000/sec100:1 ratio × 100 writes/sec
Read QPS (peak)50,000/sec5× average peak factor
Storage per URL~500 bytes7-char code + 2KB URL + metadata
Total storage150 GB300M × 500 bytes
Storage/year~1.5 TB100 URLs/sec × 86,400 × 365 × 500B
Short code length7 chars62^7 = 3.5 trillion unique codes
Hot cache size~30 GB20% of 300M URLs × 500B (80/20 rule)
App servers needed5–1050K peak QPS ÷ 10K/server
Key insight to say aloud: "150 GB fits comfortably on a single DB node — no sharding needed. The hot set (30 GB) fits in a Redis cluster. At this scale, the bottleneck is read latency, not storage — hence the 2-layer cache strategy."
High-Level Architecture
Two distinct paths — read (cache-first) and write (ID generation)
SYSTEM ARCHITECTURE — URL SHORTENER
Client
──→
Load Balancer
L7 / HTTPS / SSL termination
──→
API Servers
stateless × 5–10
                                                         ↙ READ PATH       ↘ WRITE PATH
In-Process Cache
Caffeine · 10K items · 5min TTL
miss↓
ID Generator
auto-increment + base62
Redis Cluster
url:{code} → long_url · LRU · 24h TTL
miss↓
MySQL Primary
durable write · ACID
MySQL Replicas ×3
read traffic · async replication
async↓
Kafka: click-events
fire-and-forget analytics
Read Path (redirect — must be <10ms)
Step-by-step redirect flowFLOW
GET /abc1234

1. Check in-process cache (Caffeine): ~100ns
   HIT (60%): return HTTP 302 immediately ✅

2. Check Redis cache: ~0.5ms
   HIT (39%): populate L1 cache, return HTTP 302 ✅

3. Cache MISS (1%): query MySQL read replica: ~5ms
   Populate Redis (TTL 24h), populate L1 (TTL 5min), return HTTP 302 ✅

4. Async (does NOT block redirect):
   Publish click event to Kafka → Analytics consumer updates click_count
Short Code Generation
Four strategies — trade-offs in uniqueness, predictability, and complexity
A: MD5 + Truncation
Hash the long URL → take first 7 chars of base62-encoded hash. Same URL always produces same code (natural dedup).
✓ Deterministic (dedup built-in)
✓ Stateless, no DB for ID
✗ Collision possible at 300M URLs
✗ Must detect + retry on collision
✗ Custom alias not supported
B: Auto-Increment + Base62 ★
DB auto-increment ID → encode to base62. ID 123456789 → "8M0kX". Monotonic → guaranteed unique. Simple.
✓ Guaranteed unique
✓ Simple implementation
✓ Short codes for small IDs
✗ Predictable / enumerable
✗ DB is ID single point of failure
✗ Reveals URL count to scrapers
C: Snowflake-style Distributed ID
64-bit: [41-bit timestamp][10-bit machine][12-bit seq] → base62 encode. ~4096 IDs/ms per machine. No DB dependency.
✓ Globally unique, no DB
✓ Scales to any throughput
✓ Sortable by creation time
✗ Requires machine ID management
✗ Clock skew issues
✗ More complex
D: Pre-generated Pool
Background job pre-generates random 7-char codes, stores as "available". API fetches and marks "used" atomically.
✓ Random → not enumerable
✓ No collision risk
✓ Fast fetch (no computation)
✗ Complex pre-generation logic
✗ Pool management overhead
✗ Wasted codes if URLs deleted
Base62 Implementation
Encode / decode — implement from scratch in interviewJAVA
private static final String ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static final int BASE = 62;

public static String encode(long id) {
    if (id == 0) return "0";
    StringBuilder sb = new StringBuilder();
    while (id > 0) {
        sb.append(ALPHABET.charAt((int)(id % BASE)));
        id /= BASE;
    }
    return sb.reverse().toString();
}

public static long decode(String code) {
    long result = 0;
    for (char c : code.toCharArray()) {
        result = result * BASE + ALPHABET.indexOf(c);
    }
    return result;
}

// encode(0)         → "0"
// encode(62)        → "10"  (like binary: 1×62 + 0)
// encode(123456789) → "8M0kX"
// decode(encode(n)) == n  ✅
// 62^7 = 3,521,614,606,208 ≈ 3.5 trillion unique 7-char codes
Database Schema
MySQL — ACID, B-tree indexes, read replicas for scale
TABLE: urlsMySQL / PostgreSQL
COLUMN
TYPE
NOTES
id
BIGINT
PRIMARY KEY AUTO_INCREMENT — source of short code
short_code
VARCHAR(10)
NOT NULL UNIQUE — the 7-char base62 string
long_url
TEXT
NOT NULL — original URL up to 2048 chars
user_id
BIGINT
NULL = anonymous; FK to users table if auth added
created_at
TIMESTAMP
DEFAULT CURRENT_TIMESTAMP
expires_at
TIMESTAMP
NULL = never expires; checked on every redirect
click_count
BIGINT
DEFAULT 0 — updated async via Kafka consumer
is_active
BOOLEAN
DEFAULT TRUE — soft delete flag
INDEXES
UNIQUE INDEX idx_short_code ON urls(short_code) ← primary lookup, O(log N)
INDEX idx_user ON urls(user_id) ← "my URLs" page
PARTIAL INDEX idx_expires ON urls(expires_at) WHERE expires_at IS NOT NULL ← cleanup job
Why MySQL over NoSQL? 150GB fits on one node. Access pattern is pure primary-key lookup — no scatter-gather needed. ACID prevents duplicate short_code under concurrent inserts. Read replicas handle the 100:1 read:write ratio. NoSQL would work too (DynamoDB keyed on short_code) but adds operational overhead without benefit at this scale.
Caching Strategy
Two layers — achieving 99%+ cache hit rate for redirects
In-Process (L1)
Caffeine cache in JVM heap. Top 10K URLs per server. Zero network. Sub-microsecond lookup.
~100 ns
~60% hit rate
Redis (L2)
Shared across all app servers. Key: "url:{code}" → long_url. 30 GB hot set. allkeys-lru eviction.
~0.5 ms
~39% hit rate
MySQL Replica
Only 1% of traffic reaches DB. SELECT long_url FROM urls WHERE short_code = ? Populate both cache layers on miss.
~5 ms
~1% (cache miss)
Cache read + invalidation patternJAVA
public String getLongUrl(String shortCode) {
    // L1: in-process
    String url = l1Cache.getIfPresent(shortCode);
    if (url != null) return url;

    // L2: Redis
    url = redis.get("url:" + shortCode);
    if (url != null) {
        l1Cache.put(shortCode, url);    // populate L1
        return url;
    }

    // DB: only 1% of traffic
    UrlRecord rec = db.findByShortCode(shortCode);
    if (rec == null || rec.isExpired()) return null;
    redis.setex("url:" + shortCode, 86400, rec.longUrl); // 24h TTL
    l1Cache.put(shortCode, rec.longUrl);
    return rec.longUrl;
}

// Invalidation: on URL delete
public void deleteUrl(String shortCode) {
    db.softDelete(shortCode);
    redis.del("url:" + shortCode);
    l1Cache.invalidate(shortCode);  // invalidate all instances via broadcast
}
301 vs 302 Redirect
An explicit trade-off — analytics vs server load
301
MOVED PERMANENTLY
Browser caches the redirect. Subsequent visits from same browser skip your server entirely.
✓ Browser caches → lower server load
✓ Reduced latency for repeat visits
✓ CDN can cache the response
✗ Can't track repeat clicks
✗ Can't update long URL after caching
✗ Analytics are incomplete
302
FOUND (TEMPORARY) — RECOMMENDED ★
Browser does NOT cache. Every click hits your server. Full analytics visibility.
✓ Full click analytics (every visit)
✓ Can update long URL at any time
✓ A/B testing possible
✗ Higher server load (every click)
✗ Slightly higher latency
✗ No browser-level caching
Interview answer: "I'll use 302 by default because analytics are a core requirement. If a URL provider wants maximum performance and no analytics tracking, we can expose a '301 mode' as an opt-in option. The trade-off is explicit and user-controlled."
Edge Cases & Deep Dives
What separates a good answer from a great one
URL Expiry
URLs with expires_at set should return 410 Gone after expiry.
Lazy check: on redirect, if expires_at < NOW() → 410.
Background job: daily DELETE WHERE expires_at < NOW().
Redis TTL: set same TTL as URL expiry on cache key.
Custom Aliases
User requests /my-campaign. Must be unique, safe, and not clash with system routes.
UNIQUE constraint handles conflict → 409 response.
Blocklist: reserve "api", "health", "admin".
Regex validate: alphanumeric + hyphens only.
Rate Limiting
Prevent URL bombing and redirect abuse from single IPs.
Redis counter: INCR rate:shorten:{ip}:{hour}
EXPIRE 3600 → if >100 → 429 Too Many Requests.
Separate limits for /shorten vs GET /{code}.
Viral URL (Hotkey)
Taylor Swift tweets /abc1234 → 1M req/sec hits one URL.
L1 in-process cache absorbs burst (no Redis roundtrip).
If insufficient: replicate key to N Redis shards.
CDN 301 caching as last resort (lose analytics).
URL Validation
Malicious URLs, redirect chains, oversized inputs.
Parse URL (valid http/https scheme).
Check against phishing/malware blocklist.
Detect redirect loops (sho.rt → sho.rt).
Max long URL: 2048 chars.
Analytics Lag
Kafka consumer falls behind → click_count is stale.
click_count is "approximate" by design — user expectation set.
Monitor consumer lag: alert if >100K backlog.
Scale consumer group (add instances up to numPartitions).
7-Step Framework Applied
How to present URL Shortener in 45 minutes in a FAANG interview
01
REQUIREMENTS (5m)
Shorten + redirect. 300M URLs. 100:1 read:write. <10ms redirect. 99.99% uptime. Analytics optional.
02
ESTIMATION (5m)
150GB storage. 50K peak QPS. 30GB hot set. 3.5T unique codes. No sharding needed at this scale.
03
HLD (10m)
Client→LB→API servers→[L1 cache→Redis→MySQL replica]. Write: API→MySQL primary (auto-ID→base62). Analytics: Kafka async.
04
DEEP DIVE (15m)
Code generation (base62 + auto-increment). 2-layer cache (L1+Redis). Schema with partial index. 302 vs 301 trade-off.
05
BOTTLENECKS (5m)
Cache hit rate (99%+ target). Hotkeys (viral URLs → L1 absorbs). ID generation (MySQL SPOF → use Snowflake at 10× scale).
06
FAILURES (3m)
Redis down: fall through to DB (~5ms, acceptable). DB primary down: serve from cache + replicas. Analytics lag: non-blocking.
07
SCALE UP (2m)
10× URLs → shard by hash(short_code). Multi-region → geo DNS + regional MySQL + cross-region replication. CDN for 301 redirects.
Numbers to say aloud: "150 GB fits on a single node — no sharding." · "62^7 = 3.5 trillion codes — won't exhaust for centuries." · "99% cache hit rate means 1% of redirects touch MySQL." · "302 because analytics is a requirement, with a 301 opt-in for CDN offload."
01
Implement Base62 Encode / Decode
~1 hr

Write base62 encode and decode in Java (or your language of choice):

  • Handle edge cases: encode(0), encode(1), encode(62), encode(62^7 - 1)
  • Verify roundtrip: assert decode(encode(n)) == n for 1000 random values
  • How many chars does encode(300_000_000) produce?
  • Extend: implement MD5 + truncation approach. Write the collision detection retry loop.
02
Extended Schema Design
~1.5 hrs

Extend the base schema to support:

  1. User accounts owning URLs — with "my URLs" listing page (paginated, sorted by created_at)
  2. URL groups / campaigns — tag multiple URLs together (e.g., "summer-sale")
  3. Per-URL hourly analytics — click count broken down by hour-of-day
  4. A/B testing — one short code routes 50% to URL-A, 50% to URL-B

For each: write the CREATE TABLE SQL, necessary indexes, and explain the query pattern.

03
Failure Scenario Design
~1.5 hrs

Design the failure handling for each scenario. What does the system do? What does the user experience?

  1. Redis cluster completely unavailable
  2. MySQL primary goes down mid-write (during shorten request)
  3. ID generator service (if using external Snowflake service) is unreachable
  4. Analytics Kafka topic has 5-minute lag — consumer backlog of 3M events
  5. A single URL (/abc1234) receives 1M requests/sec (DDoS via viral celebrity tweet)
Multi-Region Redesign (US + EU + APAC)
~3 hrs · full redesign

Redesign for global deployment with these constraints:

  • Users in each region get <5ms redirect latency
  • URLs created in US accessible from EU within 1 second
  • Unified analytics across all regions
  • No EU user data may leave EU (GDPR compliance)
  • System remains available if one entire region goes down

Design: DNS routing strategy, data replication model, consistency choice for URL creation, analytics aggregation, GDPR compliance approach.

0 / 15 completedMODULE B5 · URL SHORTENER
Can state all functional + non-functional requirements from memory
Capacity estimation in <5 min: 150GB, 100:1, 30GB hot set, 3.5T codes
Can draw full architecture diagram: LB → API → L1/L2 cache → DB → Kafka
All 4 short code generation strategies and their trade-offs
Can implement base62 encode/decode from scratch in an interview
DB schema: correct columns, UNIQUE index on short_code, partial index on expires_at
2-layer cache: in-process (L1) + Redis (L2), TTLs, hit rates, invalidation
301 vs 302 trade-off: analytics vs CDN offload — say the trade-off explicitly
Async analytics: Kafka click events, fire-and-forget, doesn't block redirect
Edge cases: expiry, custom aliases, rate limiting, URL validation, hotkeys
Failure handling: Redis down → DB fallback; DB primary down → replicas
Can apply full 7-step framework to URL shortener in 45 minutes
✏️ Task 1: base62 implementation with roundtrip test
✏️ Task 2: extended schema (users, campaigns, analytics, A/B)
✏️ Task 4 (capstone): multi-region redesign with GDPR compliance
NEXT MODULE
B6 — Design Twitter/X Feed
Social graph · Fan-out on write vs fan-out on read
Home timeline generation · Celebrity problem · News feed ranking
Push vs pull model · Redis timeline cache · Hybrid approach
B4: Message Queues READ STUDY NOTES ROADMAP B6: Twitter Feed