Module B7 — Design WhatsApp (Real-Time Messaging)

System Design Mastery Course | Track B: HLD | Week 17

🎯 Module Overview

Duration: 1 Week | Track: B — HLD Case Studies | Prerequisites: B1–B6 Goal: WhatsApp is the canonical real-time messaging question. It tests WebSockets, message delivery guarantees, presence systems, group messaging fan-out, and storage at scale. Appears at every FAANG.

1. Requirements

Functional

Core:
1-on-1 messaging — send/receive text, media, emoji
Group messaging — up to 1,024 members
Message delivery receipts — sent ✓, delivered ✓✓, read ✓✓ (blue)
Online presence — "Online" / "Last seen at 3:42 PM"
Media sharing — images, video, audio, documents

Out of scope: calls, disappearing messages, payments, status stories

Non-Functional

Scale:
  2 billion users, 100M DAU
  100 billion messages/day → ~1.16M messages/sec
  Average message: 100 bytes (text) to 10 MB (video)
  Group message fan-out: 1 sender → up to 1,024 recipients

Performance:
  Message delivery latency: p99 < 500ms (sender to recipient)
  Presence update propagation: p99 < 1 second

Availability: 99.99%
Message durability: zero message loss
Ordering: messages in a conversation must be in order

2. The Core Protocol: WebSockets

Why Not HTTP Polling?

Short polling (every 1s):
  100M users × 1 req/sec = 100M req/sec → server overloaded
  Wasteful: most polls return empty response
  Latency: up to 1s delay

Long polling:
  Better: client holds connection open, server responds when message arrives
  Problems: connection drops, reconnect storms, proxy timeouts
  Still stateless — server must find messages for each reconnect

WebSockets (chosen):
  Persistent TCP connection between client and server
  Bidirectional: server pushes messages to client instantly
  Full-duplex: client and server send simultaneously
  Latency: milliseconds (no polling delay)
  Efficient: one connection per user (not per message)

WebSocket Lifecycle

Client sends HTTP UPGRADE request
Server responds 101 Switching Protocols
TCP connection remains open — WebSocket frames flow bidirectionally
Heartbeat (ping/pong every 30s) keeps connection alive through NAT
On disconnect: client reconnects, fetches offline messages via REST

3. High-Level Architecture

SEND MESSAGE PATH:
[Alice's Phone]
    │ WebSocket frame: {to: Bob, content: "Hey!"}
    ↓
[Chat Server A]  ← Alice is connected here
    ├── Write message to Cassandra (durable, message_id = Snowflake)
    ├── Publish to Kafka topic: "messages" (async fanout)
    └── Return ACK to Alice: message received by server ✓

[Message Routing Service] (Kafka consumer)
    ├── Look up: which Chat Server is Bob connected to? (Session Store → Redis)
    │     → Bob is connected to Chat Server C
    └── Route message to Chat Server C

[Chat Server C]
    └── Push message to Bob via Bob's WebSocket connection ✓✓ (delivered)

[Bob reads message]
    └── Bob's client sends "read receipt" back via WebSocket
        → Chat Server C routes to Chat Server A → pushed to Alice ✓✓ (blue)

OFFLINE USER PATH:
[Bob is offline]
    Message stored in Cassandra "inbox" for Bob
    When Bob reconnects: REST API fetches all offline messages since last_seen_msg_id

4. Key Components

Chat Server (Stateful)

Each Chat Server maintains WebSocket connections for N users.
The server IS stateful — knows which users are connected locally.

At scale: 100M concurrent connections ÷ 65K connections/server = ~1,500 servers
(Modern servers with event-loop (Node.js/Netty) handle 100K+ concurrent WS connections)

So: 100M / 100K = 1,000 Chat Servers

Challenge: message must reach the EXACT server Alice is connected to.
Solution: Session Store (Redis) maps user_id → chat_server_id

Session Store (Redis)

On user connect:    SET session:{userId} serverIP EX 86400
On user disconnect: DEL session:{userId}
On heartbeat:       EXPIRE session:{userId} 86400  (refresh TTL)

Routing lookup:
  GET session:{bobId} → "chat-server-47:8080"
  Route message to that server via internal HTTP or message queue

Message Store (Cassandra)

Why Cassandra:
  ✅ High write throughput (1.16M msg/sec — Cassandra's strength)
  ✅ Partition by conversation_id → all messages in conversation on same node
  ✅ Clustering by message_id DESC → newest first, efficient pagination
  ✅ Multi-datacenter replication built-in

Schema:
CREATE TABLE messages (
    conversation_id UUID,
    message_id      BIGINT,   -- Snowflake ID (embeds timestamp)
    sender_id       BIGINT,
    content         TEXT,
    media_url       TEXT,     -- NULL if text-only
    message_type    VARCHAR,  -- 'text', 'image', 'video', 'audio'
    status          VARCHAR,  -- 'sent', 'delivered', 'read'
    created_at      TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Read: SELECT * FROM messages WHERE conversation_id = X LIMIT 50
      → single partition, newest first ✓

Inbox (Offline Messages)

When Bob is offline:
  Message is stored in messages table (durable, already done)
  Also stored in inbox:{bobId} sorted set: ZADD inbox:{bobId} {msgId} {msgId}

On Bob reconnects:
  1. Fetch last_read_message_id from Bob's profile
  2. REST call: GET /messages?since={last_read_message_id}
  3. Server queries: SELECT * FROM messages WHERE conversation_id IN (...) AND message_id > last_read
  4. Push all missed messages to Bob
  5. Update last_read_message_id = latest received

5. Message Delivery Receipts

Three-state receipt system:
  ✓  (single gray)  = message saved to server
  ✓✓ (double gray)  = message delivered to recipient's device
  ✓✓ (double blue)  = message read by recipient

Implementation:

1. Sender → Server: message sent → Server ACKs → sender shows ✓
2. Server → Recipient online:
     WebSocket push → recipient's device ACKs → server marks delivered → sender gets ✓✓
3. Recipient opens conversation:
     Client sends READ receipt → server routes to sender → shows ✓✓ blue

Storage:
   UPDATE messages SET status = 'delivered' WHERE message_id = X
   (Or: separate receipts table for read-scaling)

Group messages:
   ✓✓ shown when ALL members delivered (not just one)
   Blue ✓✓ shown when ALL members read
   Implementation: delivery_count and read_count columns per group message
   Alternatively: separate message_receipts table (sender_id, message_id, recipient_id, status, timestamp)

6. Presence System

"Online" / "Last seen at 3:42 PM"

Challenge: 100M active users — each updating presence every 30s = 3.3M updates/sec

Solution: heartbeat + Redis
  Every 30s: client sends heartbeat via WebSocket
  Server updates: SET presence:{userId} "online" EX 45
  TTL slightly longer than heartbeat → expires if heartbeat stops

Read presence:
  GET presence:{userId}
  → value exists: "Online"
  → missing (expired): "Last seen at {last_heartbeat_time}"

Scaling the presence writes:
  3.3M SETEX/sec is heavy for a single Redis cluster
  Solution: shard presence by userId hash across N Redis clusters
  Or: use Redis Cluster with consistent hashing across 16,384 slots

Privacy:
  Users can disable "last seen" → store preference, return NULL regardless
  Read receipts can also be disabled (WhatsApp privacy settings)

Fanout of presence to contacts:
  When Alice comes online → notify all of Alice's contacts who are currently online
  Expensive: Alice has 300 contacts → 300 WebSocket pushes
  In practice: subscribe-based presence
    Bob subscribes to Alice's presence only when Bob opens a chat with Alice
    Presence only pushed to subscribers (not all contacts)

7. Group Messaging

Groups: up to 1,024 members
6K group messages/sec × 1,024 avg members = 6.1M WS pushes/sec

Architecture:
  Option A: Fan-out at write time
    For each group message: push to all N member WebSocket connections
    At 1,024 members: acceptable (similar to celebrity tweet problem)
    Implementation: maintain group_members table → look up all connections → push

  Option B: Group message queue per user (WhatsApp's approach)
    Group message stored once in messages table (conversation_id = group_id)
    Each member's inbox just stores a pointer (message_id reference)
    On member request: fetch full message from messages table

Storage model:
CREATE TABLE group_members (
    group_id  UUID,
    user_id   BIGINT,
    role      VARCHAR,   -- 'admin', 'member'
    joined_at TIMESTAMP,
    PRIMARY KEY (group_id, user_id)
);

Fan-out service:
  On group message received:
    1. Store message once in messages (conversation_id = group_id)
    2. Look up all online group members from group_members table
    3. Route via Session Store to correct Chat Servers
    4. Push message to online members' WebSocket connections
    5. Store in offline inbox for offline members

Challenge: Large files (videos up to 100MB) should not go through chat servers.
           Chat servers handle tiny text messages, not video blobs.

Protocol:
  1. Sender selects media → client uploads DIRECTLY to S3 via pre-signed URL
     Client calls: POST /media/upload → server returns S3 pre-signed PUT URL
     Client uploads: PUT https://s3.../media/{uuid}.mp4 (directly to S3)
     Media service confirms receipt

  2. Media service:
     - Stores original in S3
     - Triggers thumbnail generation (image) or transcoding (video)
     - Stores processed variants in S3
     - CDN caches frequently accessed media

  3. Message contains media_url, not the binary data
     { type: "image", media_url: "https://cdn.wa.me/media/{uuid}.jpg", thumbnail: "..." }

  4. Recipient downloads media directly from CDN — chat server not involved

End-to-end encryption note:
  WhatsApp uses Signal Protocol: media encrypted on client before upload
  Server stores encrypted blob — cannot decrypt content
  Key exchange uses Diffie-Hellman through WhatsApp's key server
  (Out of scope for HLD interview — mention it, don't deep-dive)

9. Message Ordering

Challenge: messages must arrive in order within a conversation.

Approach: Snowflake ID as message_id
  Snowflake: [41-bit timestamp][10-bit machine][12-bit sequence]
  Generated at Chat Server when message received
  Globally unique + monotonically increasing per millisecond per server

Ordering guarantee:
  Messages stored in Cassandra: CLUSTERING ORDER BY message_id DESC
  Clients render messages sorted by message_id → timestamp order

Edge case: two messages sent within same millisecond
  Snowflake sequence counter handles this: up to 4096 per ms per machine
  If same machine, sequence guaranteed ordered
  If different machines: timestamp + machine_id → deterministic tie-break

Out-of-order delivery (network reordering):
  Client buffers and re-sorts by message_id before rendering
  Server-assigned Snowflake ID is canonical order — not client-assigned time

10. Estimating Scale

Storage:
  100B messages/day × 100 bytes avg = 10 TB/day (text only)
  With 3× replication (Cassandra): 30 TB/day
  After 5 years: ~55 PB — Cassandra cluster of ~500 nodes at 100 TB/node

Media:
  Assume 20% of messages include media, avg 500 KB
  100B × 20% × 500 KB = 10 PB/day (too large without TTL)
  WhatsApp in practice: media deleted from servers after download
                        (stored on device, not cloud — unlike iCloud)
  With TTL (30 days): rolling 300 PB of media on S3

WebSocket servers:
  100M concurrent users ÷ 100K connections/server = 1,000 Chat Servers
  Each server: 8 core, 64 GB RAM, persistent socket connections (Netty/Vert.x)

Redis (Session + Presence):
  100M active sessions × 50 bytes/entry = 5 GB — fits one Redis node
  100M presence entries × 50 bytes = 5 GB — fits one Redis node
  But 3.3M writes/sec for presence → Redis Cluster (10+ nodes)

Kafka (message routing):
  1.16M messages/sec × 1 KB avg = ~1.16 GB/sec
  With 3× replication: 3.5 GB/sec → 35+ Kafka nodes
  Partitions: hash(conversation_id) → ordering within conversation guaranteed

📝 Tasks

Task 1 — WebSocket Connection Management

Design the connection management system:

Alice opens WhatsApp. How does the app choose which Chat Server to connect to?
Chat Server 47 crashes. 100K users lose their connections. What happens?
Bob’s phone loses network for 60 seconds. What messages are queued? How are they delivered when reconnected?
A user has WhatsApp open on phone AND web (2 devices). How does the system manage 2 WebSocket connections for one user?

Task 2 — Delivery Receipt Implementation

Design the full 3-state delivery receipt system:

Write the state machine for message status (sent → delivered → read)
For a group of 500 members: when do ✓✓ (delivered) and blue ✓✓ (read) show?
Handle the failure case: message delivered, but “delivered” receipt lost in transit
What if the sender is offline when the receipt arrives?

Task 3 — Presence at Scale

Design a presence system for 100M concurrent users with these constraints:

Presence updates (heartbeat) every 30 seconds
“Last seen” accurate to within 1 minute
Privacy: some users hide last seen
Scale: must handle 3.3M presence writes/sec
Presence notification: when Alice comes online, notify her contacts

⭐ Task 4 — Full WhatsApp Design

End-to-end 45-minute interview simulation. Cover: requirements, estimations, architecture, WebSocket management, message routing, delivery receipts, group messaging fan-out, presence, media sharing. Estimate: Chat Server count, Cassandra nodes, Redis cluster size, Kafka throughput.

✅ Completion Checklist

  ← B6 Twitter Feed Notes
  ⚡ Interactive Module
  ↑ Roadmap
  NEXT: B8 YouTube →