Module B5 — URL Shortener (End-to-End HLD Case Study)

System Design Mastery Course | Track B: HLD | Week 15

🎯 Module Overview

Duration: 1 Week Track: B — HLD Case Studies Prerequisites: B1–B4 (all fundamentals) Goal: Apply every HLD concept in a complete end-to-end design. URL shortener is the most common FAANG warm-up question. Master it so well you can teach it.

Why URL Shortener?

Small enough to design fully in 45 minutes
Touches every core HLD topic: DB, cache, hashing, scale, availability
Read-heavy (100:1 read:write) — demonstrates caching mastery
Redirect latency SLA (<10ms p99) — forces cache-first thinking
Unique ID generation is a classic distributed systems problem

1. Requirements Clarification

Functional Requirements

Core features:
  1. POST /shorten          — Given a long URL, return a short code (e.g. "abc1234")
  2. GET /{shortCode}       — Redirect to original long URL (HTTP 301/302)
  3. Custom aliases         — User can specify their own short code
  4. TTL / expiry           — URLs can have an optional expiration date
  5. Analytics              — Click count, referrer, geo, device (optional)

Out of scope (for this design):
  - User accounts / auth
  - Dashboard UI
  - Link editing after creation

Non-Functional Requirements

Scale:
  - 300 million URLs stored
  - Write: 100 new URLs/second (peak: 500/sec)
  - Read:  10,000 redirects/second (peak: 50,000/sec)
  - Read:write ratio = 100:1

Performance:
  - Redirect latency: p99 < 10ms
  - Shorten latency:  p99 < 100ms

Durability & Availability:
  - 99.99% availability
  - URLs must never be lost (durable writes)
  - Short codes must be globally unique

Storage:
  - Short code: 7 chars → 56 billion unique codes (62^7)
  - URL record: ~500 bytes (short code + long URL + metadata)
  - Total: 300M × 500 bytes = 150 GB (fits on one DB node with room to grow)

2. Capacity Estimation

WRITE QPS:
  100 URLs/sec average → peak 500/sec
  Storage/day: 100 × 86,400 × 500 bytes = ~4.3 GB/day
  Storage/5 years: 4.3 GB × 365 × 5 = ~7.8 TB

READ QPS:
  10,000 redirects/sec average → peak 50,000/sec
  Bandwidth: 50,000 × 500 bytes = ~25 MB/sec (response)

CACHE:
  Assuming 80/20 rule: top 20% URLs get 80% of traffic
  Hot set: 300M × 0.20 × 500 bytes = ~30 GB → fits in Redis cluster
  Target: 99%+ cache hit rate for redirects

SERVERS:
  Each app server handles ~10K req/sec
  Read path: 50K req/sec → 5 servers minimum (use 10–20 with LB)
  Write path: 500 req/sec → 1 server (use 3 for HA)

3. High-Level Architecture

                    ┌─────────────────────────────────────────┐
                    │                                         │
[Browser/App] ──→ [DNS] ──→ [Load Balancer (L7 / HTTPS)]     │
                                      │                       │
                    ┌─────────────────┼───────────────────┐   │
                    │                 │                   │   │
                    ↓                 ↓                   ↓   │
             [API Server 1]   [API Server 2]   [API Server N] │
                    │                                         │
          ┌─────────┼─────────────────────────────────┐       │
          │         │                                 │       │
          ↓         ↓                                 ↓       │
    [Redis Cache] [ID Generator]              [MySQL Primary] │
    (hot URLs)    (Snowflake/                 (source of truth)│
                   base62)                         │           │
                                            [MySQL Replicas]  │
                                            (read replicas)   │
                    │                                         │
                    ↓                                         │
             [Kafka: click-events]                            │
                    │                                         │
                    ↓                                         │
           [Analytics Consumer]                              │
           (ClickHouse/BigQuery)                             │
                    └─────────────────────────────────────────┘

Read Path (Redirect — must be <10ms)

1. GET /abc1234
2. Check Redis cache: GET url:abc1234
   → HIT (99% of cases): return 302 redirect immediately (~0.5ms)
   → MISS (1%): read from MySQL replica (~5ms)
3. On cache miss: populate Redis (TTL 24h), return redirect
4. Async: publish click event to Kafka (fire-and-forget, doesn't block redirect)

Write Path (Shorten URL)

POST /shorten { longUrl, customAlias?, expiresAt? }
Validate URL (parseable, not already in blocklist)
Generate short code (see Section 4)
Write to MySQL primary (durable)
Optionally pre-warm Redis cache
Return { shortUrl: "https://sho.rt/abc1234" }

4. Short Code Generation

Option A: MD5 Hash + Truncation

String shortCode = Base62.encode(MD5.hash(longUrl)).substring(0, 7);

Pros:
  ✅ Deterministic — same URL always gets same short code (natural dedup)
  ✅ No ID generator dependency
  ✅ Stateless — any server can generate independently

Cons:
  ❌ Hash collision: two different URLs can produce same 7-char prefix
     Collision probability: 1/(62^7) = 1 in 3.5 trillion — but at 300M URLs it matters
  ❌ Resolution: detect collision, append counter, rehash
  ❌ Custom aliases not supported

Option B: Auto-Increment ID + Base62 Encoding (Recommended)

// DB generates auto-increment ID
long id = db.insertAndGetId(longUrl);  // e.g., id = 123456789

// Encode to base62
String shortCode = base62Encode(id);   // "8M0kX" (5 chars for this ID)

Base62 alphabet: 0-9, a-z, A-Z (62 characters)
7 characters → 62^7 = 3.5 trillion unique codes

base62Encode(123456789):
  123456789 % 62 = 41 → 'p'
  123456789 / 62 = 1991238 → recurse
  Result: "8M0kX" (variable length, pad to 7 if needed)

Pros:
  ✅ Guaranteed unique (monotonically increasing ID)
  ✅ Simple, fast
  ✅ Short — lower IDs = shorter codes

Cons:
  ❌ Predictable — attackers can enumerate URLs (ID 1, 2, 3...)
  ❌ Single point of failure if using one DB for ID generation

Option C: Distributed ID (Snowflake-style)

64-bit ID structure:
  [41 bits: timestamp ms] [10 bits: machine ID] [12 bits: sequence]

Generates ~4096 IDs/ms per machine. Globally unique. No DB dependency.
Encode result in base62 → 7-char code.

Use when: extremely high write throughput, multi-region deployment.

Option D: Pre-generated Code Pool

Background job pre-generates random 7-char base62 codes.
Stores in a "codes" table with status: available/used.
API server fetches next available code atomically (SELECT FOR UPDATE).

Pros:
  ✅ No collision risk
  ✅ Random → not enumerable
Cons:
  ❌ More complex, pre-generation overhead

Recommendation: Option B (auto-increment + base62) for simplicity. Add random salt to prevent enumeration if needed.

5. Database Schema

CREATE TABLE urls (
    id           BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_code   VARCHAR(10) NOT NULL UNIQUE,
    long_url     TEXT NOT NULL,
    user_id      BIGINT,              -- NULL if anonymous
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at   TIMESTAMP,           -- NULL = never expires
    click_count  BIGINT DEFAULT 0,    -- approximate, updated async
    is_active    BOOLEAN DEFAULT TRUE
);

CREATE UNIQUE INDEX idx_short_code ON urls(short_code);
CREATE INDEX idx_user_id ON urls(user_id);  -- "my URLs" page
CREATE INDEX idx_expires_at ON urls(expires_at) WHERE expires_at IS NOT NULL;
-- Partial index only on rows with expiry → smaller, faster for cleanup job

DB Choice: MySQL (or PostgreSQL)

Justification:
  - 150 GB total storage → fits comfortably on single node
  - Access pattern: primary key lookup (short_code) → perfect for SQL B-tree index
  - ACID required: prevent duplicate short codes under concurrent inserts
  - Complex queries not needed → NoSQL would work too, but SQL is simpler

Read replicas: 3 replicas for 10K+ redirects/sec read load
Sharding: not needed at 150 GB — if URLs reach 10 TB, shard by hash(short_code)

6. Caching Strategy

Layer 1: In-process cache (Caffeine)
  - Size: top 10,000 most accessed URLs per server
  - TTL: 5 minutes
  - Hit rate: ~60% of traffic (power-law distribution)

Layer 2: Redis
  - Key: "url:{shortCode}"
  - Value: long_url string
  - TTL: 24 hours (most URLs accessed within 24h of creation)
  - Hit rate: ~99% of remaining traffic
  - Eviction: allkeys-lru (hot URLs stay, cold ones evicted)

Cache miss path:
  → Query MySQL read replica (~5ms)
  → Populate Redis (TTL 24h)
  → Populate in-process cache (TTL 5min)

Cache invalidation:
  - URL deleted: delete from Redis immediately
  - URL expired: TTL-based expiry or background job deletes + Redis key deleted
  - Click count: NOT cached — updated async via Kafka consumer to MySQL

7. Redirect: 301 vs 302

HTTP 301 Moved Permanently:
  ✅ Browser caches the redirect → subsequent visits bypass your server
  ✅ Reduces server load
  ❌ Can't track repeat visits from same browser
  ❌ If long URL changes, browsers serve old cached redirect
  Use when: link never changes, analytics not critical

HTTP 302 Found (Temporary):
  ✅ Browser does NOT cache → every visit hits your server
  ✅ Full analytics on every click
  ✅ Can update long URL at any time
  ❌ Higher server load (every click hits you)
  Use when: analytics required, URLs may change (RECOMMENDED for TinyURL)

8. Analytics Pipeline

Async — does NOT block the redirect:

[Redirect handler] ──fire-and-forget──→ [Kafka topic: click-events]
                                                │
                    ┌───────────────────────────┤
                    │                           │
                    ↓                           ↓
          [Click Count Consumer]     [Analytics Consumer]
          (UPDATE urls SET           (stream to ClickHouse
           click_count += 1)         for dashboard queries:
                                     geo, device, referrer,
                                     hourly/daily aggregates)

Click event schema:
{
  shortCode: "abc1234",
  timestamp: 1700000000,
  ipAddress: "1.2.3.4",      // → geo lookup
  userAgent: "Chrome/120",   // → device type
  referrer:  "twitter.com"
}

9. Edge Cases & Deep Dives

Expiry Handling

Option A: Lazy expiry
  On redirect, check expires_at. If expired, return 404. Let Redis TTL handle cache.
  Simple — no background job needed.

Option B: Background cleanup job
  Scheduled daily: DELETE FROM urls WHERE expires_at < NOW()
  Prevents DB from filling with dead rows.
  Combined approach: lazy check + periodic cleanup.

Custom Aliases

POST /shorten { longUrl: "...", alias: "my-campaign" }

Validation:
  - Must be unique (UNIQUE index on short_code handles this)
  - Alphanumeric + hyphens only (prevent injection, reserved words)
  - Max length 50 chars
  - Reserve system words: "api", "health", "admin" → blocklist

On conflict: return HTTP 409 Conflict

Rate Limiting

Prevent abuse (URL bombing, scraping):
  - 100 shorten requests per IP per hour
  - 1000 redirect requests per IP per minute
  Implementation: Redis counter with TTL
    INCR rate:shorten:{ip}:{hour}
    EXPIRE rate:shorten:{ip}:{hour} 3600
    If count > 100 → return HTTP 429

URL Validation

Before storing:
Parse URL (valid scheme: http/https)
Check against blocklist (malware domains, phishing lists)
Check for redirect loops (short URL pointing to another short URL)
Max long URL length: 2048 chars

High Traffic URLs (Hotkeys)

Problem: A viral tweet links to abc1234 → 100K requests/second

Solutions:
  1. In-process cache per server handles burst (no Redis roundtrip at all)
  2. If still insufficient: replicate hot key across N Redis instances
     GET url:abc1234:0, :1, :2 → round-robin reads
  3. CDN: cache the redirect response at CDN edge
     (only works for 301, not 302 — tradeoff with analytics)

10. Interview Framework Applied

Step 1 — Requirements (5 min): [See Section 1]
Step 2 — Estimation (5 min):   300M URLs, 100:1 read:write, 150GB total
Step 3 — HLD (10 min):         [See architecture diagram, Section 3]
Step 4 — Deep dive (15 min):   Short code generation + caching strategy
Step 5 — Bottlenecks (5 min):  Cache hit rate, hot URLs, write bottleneck (ID gen)
Step 6 — Failures (2 min):     Redis down → fallback to DB; DB down → serve cached
Step 7 — Scale (3 min):        >1 TB → shard by hash(short_code), multi-region

Key numbers to say out loud:
  "150 GB fits on one DB — no sharding needed initially"
  "99% cache hit rate — nearly all redirects never touch MySQL"
  "7-char base62 = 3.5 trillion codes — won't run out for centuries"
  "302 for analytics, 301 for CDN offload — explicit trade-off"

📝 Tasks

Task 1 — Implement Short Code Generator

Write base62 encode/decode in Java. Handle: encode(0), encode(1), encode(62), encode(62^7 - 1). Verify: decode(encode(n)) == n. Handle collision strategy for MD5 approach.

Task 2 — Schema Design

Extend the schema to support:

User accounts owning URLs
URL groups / campaigns
Per-URL click analytics (hourly buckets)
A/B testing (one short code → two long URLs, split traffic)

Task 3 — Failure Scenarios

Design the failure handling for:

Redis cluster completely down
MySQL primary goes down mid-write
ID generator service is unavailable
Analytics Kafka topic has 5-minute lag (consumer backlog)
A single URL receives 1M req/sec (DDoS via viral link)

⭐ Task 4 — Full Redesign: Multi-Region

Redesign for global deployment (US + EU + APAC):

Users in each region should get <5ms redirect latency
URLs created in US should be accessible from EU immediately
Analytics should be unified across regions
No user data should leave EU (GDPR)

Design: DNS routing strategy, data replication approach, consistency model for cross-region writes, analytics aggregation.

✅ Module B5 Completion Checklist

  ← B4 Message Queues Notes
  ⚡ Interactive Module
  ↑ Roadmap
  NEXT: B6 Twitter Feed →