SYSTEM DESIGN MASTERY · TRACK B · MODULE B7 · WEEK 17 REAL-TIME MESSAGING · WEBSOCKETS · E2E ENCRYPTION

Case Study No. 3 · Real-Time Messaging · Delivery Receipts

Design
WhatsApp

WEBSOCKETS · CASSANDRA · PRESENCE SYSTEM
GROUP MESSAGING · MEDIA PIPELINE · S3 + CDN

USERS

100B

MSG/DAY

CHAT SERVERS

MODULE

WebSocket

Cassandra

Session Store

Delivery Receipts

Presence

Group Fan-Out

S3 Media

Requirements

Establish scope — then everything follows from the constraints

Functional

        1-on-1 messaging (text, media, emoji)

        Group messaging (up to 1,024 members)

        Delivery receipts (sent / delivered / read)

        Online presence + "last seen"

        Media sharing (images, video, audio)

        OUT OF SCOPE: calls, disappearing msgs, payments

Non-Functional

        2B users, 100M DAU

        100B messages/day → 1.16M msg/sec

        Group fan-out: 1 → up to 1,024 recipients

        Delivery latency p99 < 500ms

        Presence propagation p99 < 1 second

        Availability: 99.99%

        Durability: zero message loss

The core insight: 100M concurrent users × persistent WebSocket connection = 1,000 Chat Servers. Each server is stateful — it owns those connections. Routing a message means finding the exact server the recipient is connected to. That's the central routing problem.

Why WebSockets?

Comparing polling vs long-polling vs WebSockets

Short Polling

CLIENT ASKS EVERY 1 SECOND

100M users × 1 req/sec = 100M req/sec. Server overloaded. Most responses are empty. 1-second worst-case latency. Unacceptable.

Long Polling

HOLD CONNECTION UNTIL MESSAGE

Better than polling. Still stateless. Proxy timeouts force reconnects. Each reconnect requires re-authentication. Reconnect storms on server restart.

WebSockets ★

PERSISTENT BIDIRECTIONAL TCP

One connection per user. Server pushes instantly. Sub-10ms delivery. Heartbeat ping/pong keeps alive through NAT. Full-duplex — both sides send simultaneously.

WebSocket lifecyclePROTOCOL

// 1. HTTP Upgrade handshake
GET /ws HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

// Server responds:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade

// 2. TCP connection stays open — frames flow bidirectionally
Client → Server: {"type":"message","to":456,"content":"Hey!"}
Server → Client: {"type":"message","from":123,"content":"What's up?"}
Server → Client: {"type":"ack","msgId":7890,"status":"delivered"}

// 3. Heartbeat every 30s — keeps connection alive through NAT
Server → Client: PING
Client → Server: PONG

// 4. On disconnect: client reconnects, fetches offline messages via REST
GET /messages/offline?since="last_message_id"

Message Send / Receive Path

End-to-end: Alice sends "Hey!" → Bob receives in <500ms

// ALICE SENDS MESSAGE TO BOB — FULL PATH

[Alice]

──WS──→

[Chat Server A] Alice's persistent connection

[Server A]

writes to

[Cassandra: messages table] durable, message_id = Snowflake

[Server A]

──WS──→

[Alice] ACK: message saved ✓ (single grey tick)

[Server A]

publishes

[Kafka: "messages"] async routing

[Router]

GET session:{bobId}

[Redis Session Store] → "chat-server-C:8080"

[Router]

HTTP POST

[Chat Server C] forward message to Bob's server

[Server C]

──WS──→

[Bob] message delivered ✓✓ (double grey)

[Bob]

opens chat →

sends READ receipt → Server C → Server A → Alice ✓✓ blue

Offline Message Delivery

Bob is offline when message sent — then reconnectsFLOW

// Bob is offline: message stored in Cassandra (already done — Step 2 above)
// Also: store message_id in inbox:{bobId} sorted set
ZADD inbox:bob {message_id} {message_id}

// Bob reconnects (WebSocket upgrade):
// 1. REST call to fetch missed messages
GET /messages/offline?userId=bob&since={lastReadMessageId}

// 2. Server queries Cassandra for all conversations Bob participates in
// 3. Returns all messages with message_id > lastReadMessageId
// 4. Push to Bob's new WebSocket connection
// 5. Update last_read_message_id = latest received
// 6. Mark each message as delivered → route receipts back to senders

Message Delivery Receipts

Three-state system — the detail that separates good answers from great ones

✓

SENT

Single grey tick.
Message saved to Chat Server & Cassandra.
Server ACKs to sender immediately.
Guarantee: message will not be lost.

✓✓

DELIVERED

Double grey tick.
Message reached recipient's device.
Recipient's client ACKs via WebSocket.
Server updates status, notifies sender.

✓✓

READ

Double blue tick.
Recipient opened the conversation.
Client sends READ event via WebSocket.
Routed back to sender. Privacy: can be disabled.

Group message receipt trackingSQL / CASSANDRA

-- Option A: counters on message row
ALTER TABLE messages ADD delivered_count INT DEFAULT 0;
ALTER TABLE messages ADD read_count INT DEFAULT 0;
-- ✓✓ shown when delivered_count = group_size
-- ✓✓ blue when read_count = group_size

-- Option B: per-recipient receipts table (more granular, scales better)
CREATE TABLE message_receipts (
    message_id    BIGINT,
    recipient_id  BIGINT,
    status        VARCHAR,  -- 'delivered' | 'read'
    updated_at    TIMESTAMP,
    PRIMARY KEY (message_id, recipient_id)
);
-- Query: SELECT COUNT(*) FROM message_receipts
--        WHERE message_id = X AND status = 'read'
-- Compare to group size → determine if all have read

Failure case: What if delivered-ACK is lost in transit? The sender doesn't get ✓✓ but the message was delivered. The recipient should re-send the ACK on next heartbeat or reconnect. Delivered status is idempotent — sending it twice is harmless.

Presence System

100M concurrent users sending heartbeats every 30s = 3.3M writes/sec

// PRESENCE LIFECYCLE — ALICE'S SESSION

T=0:00

CONNECT

SET presence:alice "online" EX 45 | Chat Server registers session

T=0:30

HEARTBEAT

EXPIRE presence:alice 45 | Refreshes TTL — keeps "online"

T=1:00

HEARTBEAT

EXPIRE presence:alice 45 | Continuous 30s cadence

T=1:05

DISCONNECT

No explicit DEL — key expires in 45 - 5 = 40s

T=1:45

EXPIRED

Key gone → GET presence:alice returns NULL → "Last seen 1:05"

Presence read + subscriber-based pushREDIS

// Write: every 30s heartbeat via WebSocket
SETEX presence:{userId} 45 "online"   // 45s TTL > 30s heartbeat interval

// Read: check if user is online
String val = redis.get("presence:" + userId);
if (val != null) return "Online";
else return "Last seen at " + db.getLastSeen(userId);

// Scaling the writes:
// 100M active users × 1 write/30s = 3.3M writes/sec
// Redis Cluster: shard by hash(userId) across 10+ nodes
// Each node handles ~330K writes/sec → achievable

// Subscriber-based presence notifications (avoids fan-out to all contacts):
// Bob opens chat with Alice → subscribe to presence:{aliceId}
// Alice comes online → notify only active subscribers (open chat windows)
// NOT: notify all 300 of Alice's contacts (expensive, most don't care)

Group Messaging

1 sender → up to 1,024 recipients — fan-out at delivery time

Store Once, Route Many

Group message stored ONE TIME in Cassandra (conversation_id = group_id). Each member's inbox stores only the message_id (pointer). No data duplication.

messages table: 1 row
group_member inboxes: 1,024 message_id pointers
Storage: 1× message + 1,024× 8-byte pointers

Fan-Out at Delivery

Fan-out service (Kafka consumer) looks up online group members, finds their Chat Servers via Session Store, routes message to each. Offline members get message in inbox on reconnect.

Online (500 of 1,024): WS push immediately
Offline (524 of 1,024): inbox entry → REST on reconnect
Throughput: 1,000 groups × 1 msg/group/sec × 512 = 512K pushes/sec

Group message fan-out serviceJAVA

public void handleGroupMessage(GroupMessageEvent e) {
    // 1. Message already stored in Cassandra by Chat Server
    UUID groupId = e.groupId;
    long messageId = e.messageId;

    // 2. Fetch all group members (paginated from Cassandra/Redis)
    List<Long> members = groupService.getMembers(groupId);

    for (long memberId : members) {
        if (memberId == e.senderId) continue;  // skip sender

        String serverAddr = redis.get("session:" + memberId);

        if (serverAddr != null) {
            // Online: route to their Chat Server
            chatRouter.deliver(serverAddr, memberId, messageId);
        } else {
            // Offline: store in inbox for later delivery
            redis.zadd("inbox:" + memberId, messageId, messageId);
        }
    }
}

Data Models

Messages (Cassandra), Social Graph, Sessions (Redis)

TABLE: messages — CassandraPRIMARY KEY (conversation_id, message_id DESC)

COLUMN TYPE NOTES

conversation_id

UUID

Partition key — all messages in chat on same node

message_id

BIGINT

Clustering key DESC — Snowflake ID, newest first, embeds timestamp

sender_id

BIGINT

Who sent it

content

TEXT

Encrypted text (E2E: only decryptable on device)

media_url

TEXT

S3/CDN URL, NULL for text messages

message_type

VARCHAR

'text' | 'image' | 'video' | 'audio' | 'document'

status

VARCHAR

'sent' | 'delivered' | 'read' — updated by receipt events

        QUERY "last 50 msgs": SELECT * FROM messages WHERE conversation_id = X LIMIT 50 → single partition ✓

        ORDERING: message_id DESC → newest first, no sort needed ✓

        PAGINATION: WHERE message_id < {cursor} LIMIT 50 → keyset pagination ✓

Media Upload Protocol

Client uploads directly to S3 — Chat Server never touches media bytesFLOW

// Step 1: Client requests pre-signed upload URL
POST /media/upload/presign
Response: {uploadUrl: "https://s3.../media/uuid.jpg?X-Amz-Signature=...", mediaId: "uuid"}

// Step 2: Client uploads directly to S3 (NOT through Chat Server)
PUT https://s3.amazonaws.com/wa-media/uuid.jpg
Content-Type: image/jpeg
Body: [encrypted image bytes]

// Step 3: Client sends message with media reference
WS: {type:"message", to:456, mediaId:"uuid", mediaType:"image", thumbnail:"base64..."}

// Step 4: Recipient downloads from CDN (not Chat Server)
GET https://cdn.wa.me/media/uuid.jpg  ← edge-cached, fast

// Benefits:
// Chat servers handle only ~200 byte WS frames (never MB of media)
// S3 + CDN handle bandwidth independently
// E2E encryption: client encrypts before upload, only recipient can decrypt

Scale & Estimation

Numbers that anchor every architectural decision

COMPONENT	VALUE	CALCULATION
Chat Servers	~1,000	100M concurrent users ÷ 100K WS connections/server
Message throughput	1.16M msg/sec	100B msg/day ÷ 86,400
Text storage/day	10 TB/day	1.16M/sec × 100 bytes × 86,400
With Cassandra replication (3×)	30 TB/day	10 TB × 3 replicas
Cassandra nodes (5yr)	~500 nodes	30 TB/day × 365 × 5 = 55 PB ÷ 100 TB/node
Session Store (Redis)	5 GB	100M sessions × 50 bytes = 5 GB — fits 1 node
Presence writes/sec	3.3M/sec	100M users ÷ 30s heartbeat
Presence Redis nodes	10+ nodes	3.3M ops/sec ÷ 300K ops/node
Kafka throughput	~1.2 GB/sec	1.16M msg/sec × 1 KB avg × 3 replicas

Key numbers to say aloud: "1,000 Chat Servers for 100M concurrent WebSocket connections." · "Cassandra (conversation_id, message_id DESC) — single partition read for chat history." · "Presence writes at 3.3M/sec require a Redis Cluster, not a single node." · "Media never touches Chat Servers — S3 pre-signed URL + CDN."

WebSocket Connection Management

~1.5 hrs

›

Alice opens WhatsApp. How does the app choose which Chat Server to connect to? (hint: load balancer with sticky sessions? consistent hashing?)
Chat Server 47 crashes. 100K users lose their connections. What happens step-by-step? How long until they're reconnected?
Bob's phone loses network for 60 seconds. What is queued where? Walk through the exact delivery sequence when he reconnects.
WhatsApp Web: same account open on phone AND laptop. Design the multi-device connection model. How does a message reach both devices?

Delivery Receipt State Machine

~1 hr

›

Draw the state machine for message status: valid states, valid transitions, triggering events
Group of 500 members: when exactly do ✓✓ (delivered) and ✓✓ blue (read) show? All 500? First? Majority?
Failure: message delivered to Bob, but "delivered" ACK lost in transit. How does the system eventually become consistent?
Alice is offline when Bob's read receipt arrives. Where is it stored? When does Alice see the blue tick?

Presence System at 100M Users

~1.5 hrs

›

Design a presence system with these constraints:

Presence heartbeat every 30s from each online user
"Last seen" accurate to within 1 minute
Privacy: some users hide last seen entirely
Must handle 3.3M presence writes/sec
When Alice comes online, notify Bob (who has a chat open with Alice)

Design the Redis schema, write path, read path, and subscriber notification mechanism.

★

Full WhatsApp Design — 45-min Simulation

~3 hrs

›

Apply all 7 framework steps. Time to 45 minutes:

Requirements + estimations: Chat Server count, Cassandra nodes, Redis cluster, Kafka throughput
Full architecture diagram: all components, data flows
Deep dive: WebSocket routing (Session Store lookup, Chat Server statefulness)
Deep dive: group messaging fan-out at 1,024 members
Deep dive: presence at 100M users (heartbeat + Redis TTL + subscriber push)
Failure modes: Chat Server crash, Redis down, Cassandra shard failure
Media pipeline: S3 pre-signed URL → CDN → E2E encryption mention

0 / 15 completedMODULE B7 · WHATSAPP

WebSockets vs polling: why WS, lifecycle, heartbeat, reconnect

Chat Server statefulness: owns connections, Session Store maps user→server

Full send path: Alice → Server A → Cassandra → Kafka → Router → Server C → Bob

Session Store: SET session:{userId} serverAddr EX 86400 on connect

Cassandra schema: (conversation_id, message_id DESC) — single partition chat history

Offline delivery: inbox sorted set + REST fetch on reconnect with cursor

3-state receipts: ✓ sent, ✓✓ delivered, ✓✓ blue read — state machine

Group receipts: message_receipts table with per-recipient status

Presence: SETEX 45s TTL + heartbeat 30s + subscribe-based notifications

Presence scale: 3.3M writes/sec → Redis Cluster (10+ nodes)

Group messaging: store once, route many, inbox for offline members

Media: S3 pre-signed URL (client uploads directly) + CDN delivery

Scale numbers: 1K Chat Servers, 30 TB/day, 3.3M presence writes/sec

✏️ Tasks 1–3: WS management, receipt state machine, presence design

✏️ Task 4 (capstone): full WhatsApp — 45-min interview simulation

// NEXT MODULE

B8 — Design YouTube

      Video upload pipeline · Transcoding (HLS adaptive bitrate)

      CDN video delivery · View counter · Recommendation engine overview

      Search indexing · Comment system · Storage at petabyte scale