Module B4 — Message Queues

Why Message Queues?

The six problems a message queue solves — in every distributed system

Core value propositionCONCEPT

// WITHOUT message queue: tight coupling, brittle
[Order Service] ──sync──→ [Inventory Service]  // What if inventory is down?
[Order Service] ──sync──→ [Email Service]      // What if email is slow (2s)?
[Order Service] ──sync──→ [Analytics Service]  // User waits for all 3!

// WITH message queue: decoupled, resilient, fast
[Order Service] ──→ [Topic: order-placed] ──→ [Inventory]  ← independent
                                               ──→ [Email]      ← independent
                                               ──→ [Analytics]  ← independent

Order Service returns in <1ms. Downstream services process asynchronously.
If email is down → messages queue up → processed when it recovers.
Each consumer processes at its own rate. No cascading failures.

DECOUPLING

Producer doesn't know consumers exist. Add/remove consumers without touching producer. Services evolve independently.

BUFFERING

Black Friday spike: 100K orders/sec hits queue. Consumer processes at steady 10K/sec. Queue absorbs the burst — no DB overload.

REPLAY

Kafka retains messages for days/years. New service deployed? Read from offset 0 — replay all historical events. Invaluable for debugging.

Kafka Architecture

Topic → Partitions → Offsets → Consumer Groups — the mental model

// TOPIC: "order-placed" — 3 PARTITIONS, REPLICATION FACTOR 3

PARTITION 0 — Broker 1 (leader)

off:0 order#101

off:1 order#104

off:2 order#107

PARTITION 1 — Broker 2 (leader)

off:0 order#102

off:1 order#105

off:2 order#108

PARTITION 2 — Broker 3 (leader)

off:0 order#103

off:1 order#106

off:2 order#109

// THREE INDEPENDENT CONSUMER GROUPS — all receive all messages

GROUP A: inventory-service

Consumer A1 → P0
Consumer A2 → P1
Consumer A3 → P2
3 consumers = 3 partitions ✓

GROUP B: email-service

Consumer B1 → P0, P1, P2

1 consumer handles all — slower, but independent

GROUP C: analytics-service

Consumer C1 → P0
Consumer C2 → P1
Consumer C3 → P2
fully parallel ✓

The key insight: Each consumer group maintains its own offset per partition. Group A being at offset 50 has zero effect on Group B at offset 12. They're completely independent readers of the same durable log.

Replication & Durability

ISR and acks configurationKAFKA CONFIG

// replication.factor=3 → 1 leader + 2 ISR replicas per partition
// ISR = In-Sync Replicas (have replicated all leader messages)

acks=0:  Producer doesn't wait for ACK. Fastest, no durability guarantee.
acks=1:  Leader ACKs after writing. Fast, but replica may not have it yet.
acks=all:All ISR ACKs before producer gets confirmation. Strongest guarantee.

// With acks=all + min.insync.replicas=2 + replication.factor=3:
// → Can lose 1 broker with ZERO data loss
// → Brokers 1 (leader) + Broker 2 (replica) both have message before ACK
// → Broker 1 dies → Broker 2 becomes leader → no data lost

// Partition key routing:
producer.send("order-placed", userId, orderEvent);
// hash(userId) % numPartitions → same userId → same partition → ordered

Delivery Semantics

What happens when things go wrong — the three guarantees

AT-MOST-ONCE

FIRE AND FORGET

Producer sends, no retry. Consumer auto-commits offset BEFORE processing.

Failure: consumer crashes after commit, before processing → message LOST.

Use: analytics, metrics, logs
(loss is acceptable)

AT-LEAST-ONCE

DEFAULT — MOST COMMON

Producer retries on failure. Consumer commits AFTER processing.

Failure: consumer crashes after processing, before commit → message DUPLICATED.

Use: most systems — combine with
idempotent consumer pattern

EXACTLY-ONCE

HARDEST — ~20% COST

Idempotent producer (seq dedup) + Kafka transactions (atomic write + commit).

No loss, no duplicates. Requires enable.idempotence=true + transactional.id.

Use: financial transactions,
inventory — real harm from duplication

Idempotent Consumer — Making At-Least-Once Safe

Processing the same message twice must be safeJAVA

public void processPayment(PaymentEvent e) {
    // Check idempotency key — has this message been processed before?
    if (db.exists("processed:" + e.idempotencyKey)) {
        log.info("Duplicate — skipping: {}", e.idempotencyKey);
        return;  // Silently no-op on duplicate
    }

    // Atomic: process + mark as processed in same DB transaction
    db.transaction(() -> {
        db.debitAccount(e.accountId, e.amount);
        db.markProcessed("processed:" + e.idempotencyKey);
    });
    // Now commit Kafka offset — at-least-once is effectively exactly-once
    consumer.commitSync();
}

// Key: idempotencyKey must uniquely identify the business operation
// Options: UUID in message, (userId + orderId + action), event sequence number

Interview insight: Exactly-once in Kafka is real but expensive (~20% throughput cost). In practice, most teams use at-least-once + idempotent consumers. The idempotency key is the secret weapon — if processing the same event twice produces the same DB state, you've achieved the effect of exactly-once without the overhead.

Kafka vs RabbitMQ

Not competing — different tools for different jobs

ASPECT	KAFKA	RABBITMQ
Delivery model	Pull (consumers fetch at own pace)	Push (broker delivers to consumer)
Message retention	Durable log — days to years after delivery	Deleted on ACK — ephemeral
Throughput	Millions of messages/second	Hundreds of thousands/sec
Ordering guarantee	Within a partition (by key)	Within a queue (single consumer)
Replay history	Yes — seek to any offset	No — deleted after consume
Multiple consumers	N independent consumer groups	Competing consumers (one gets each msg)
Routing logic	Topic/partition key only	Exchanges: direct, topic, fanout, headers
Message TTL	Topic-level retention only	Per-message TTL, priority queues

Decision heuristic: "Do I need replay, multiple independent consumers, or millions of events/sec?" → Kafka. "Do I need complex routing, per-message TTL, or a simple work queue?" → RabbitMQ. In practice: use Kafka for event streaming backbone, RabbitMQ for task queues with routing logic.

When to Use Each — Concrete Scenarios

CHOOSE KAFKA

            ✓ Order lifecycle events (multiple services react)

            ✓ User activity stream (audit log needed)

            ✓ CDC: DB changes → Elasticsearch

            ✓ Real-time analytics pipeline

            ✓ Microservice event backbone

            ✓ New service needs historical data (replay)

CHOOSE RABBITMQ

            ✓ Background email/SMS sending workers

            ✓ Task distribution to N worker processes

            ✓ Routing by message type to different queues

            ✓ Per-job TTL (expire unprocessed jobs)

            ✓ Priority queue (high-priority tasks first)

            ✓ Simple job scheduler without replay needs

Key Kafka Patterns

Fan-out · DLQ · Back-pressure · Ordering · Log compaction

Fan-Out via Consumer Groups

One topic → N consumer groups, all receiving all messages independently. The canonical Kafka pattern for microservices.

Topic "order-placed" →
Group inventory-svc, Group email-svc,
Group analytics-svc, Group fraud-svc

Dead Letter Queue (DLQ)

After N failed retries, route to DLQ topic. Prevents one bad message from blocking the queue. Inspect/replay after fix.

topic → consumer → fail ×3
→ send to topic.dlq
→ alert on-call → fix → replay

Back-Pressure Monitoring

Consumer lag = latest_offset − committed_offset. High lag = consumer falling behind. Scale consumers (up to numPartitions) or optimize logic.

Alert: consumer_lag > 10,000
Action: scale consumer group
Ceiling: max_consumers = num_partitions

Per-Entity Ordering

Partition by user_id or order_id → all events for same entity → same partition → strictly ordered. Different entities process in parallel.

producer.send(topic, userId, event)
→ hash(userId) % numPartitions
→ same partition = in-order

Log Compaction

Retain only the LATEST message per key. New consumers rebuild current state without full history. Like a change-data-capture snapshot.

Before: [u1:v1][u2:v1][u1:v2][u1:v3]
After: [u2:v1][u1:v3] (latest per key)
Use: user profiles, config, inventory

Outbox Pattern

Write to DB outbox table and publish to Kafka atomically (same transaction). Prevents dual-write inconsistency between DB and queue.

BEGIN TRANSACTION
INSERT INTO orders ...
INSERT INTO outbox (event, payload)
COMMIT → CDC picks up → Kafka

Configuration Cheatsheet

Producer · Consumer · Topic — the settings that matter in interviews

Producer

acksall

enable.idempotencetrue

retriesMAX_INT

linger.ms5

batch.size16384

compression.typesnappy

transactional.idunique-id

Consumer

enable.auto.commitfalse

auto.offset.resetearliest

max.poll.records500

session.timeout.ms30000

isolation.levelread_committed

fetch.min.bytes1

heartbeat.interval3000

Topic

num.partitions12

replication.factor3

min.insync.replicas2

retention.ms604800000

cleanup.policydelete

cleanup.policycompact

max.message.bytes1048576

Scale Estimation

How many partitions do I need?MATH

// Example: order event stream
// 1M orders/day, peak 50× average

Peak events/sec  = 1M orders/day ÷ 86,400 × 50 = ~580 events/sec
Event size       = 1 KB
Peak throughput  = 580 × 1KB = ~0.6 MB/sec

// Single partition max throughput: ~100 MB/sec write
Partitions needed = 0.6 MB/sec ÷ 100 MB/sec = 1 partition (use 12 for growth headroom)

// Storage (7-day retention, 3 replicas):
Daily = 580 events/sec × 86,400 × 1 KB = ~50 GB/day
Total = 50 GB × 7 days × 3 replicas   = ~1.05 TB

// General rules:
//   num_partitions ≥ max_consumers_in_any_group
//   num_partitions = target_throughput_MB_s ÷ throughput_per_partition_MB_s
//   Start with 12–24, easier to add partitions than subtract

01

Delivery Semantics — 5 Systems

~1 hr

›

For each, choose at-most-once / at-least-once / exactly-once. State the failure scenario and idempotency strategy:

Real-time page view counter for analytics dashboard
Bank transfer between two accounts triggered by Kafka event
Email notification: "Your order has shipped" (user receives one email)
Inventory decrement when an order is placed (oversell = bad)
User activity feed update — showing what friends liked

For each: what is the exact failure scenario if you choose wrong?

02

Partition Key Design — 5 Scenarios

~1 hr

›

For each: choose the partition key, state the ordering guarantee provided, and identify any potential hotkey risk:

E-commerce order events: created → paid → shipped → delivered (must be in order)
WhatsApp group chat messages (order within a conversation matters)
Real-time stock prices for 10,000 tickers (Apple trades 1000× more than a small cap)
IoT sensor readings from 100K devices
User login/logout events (session coherence required)

03

Kafka vs RabbitMQ — 5 Decisions

~45 min

›

8 microservices all need to react to every new user signup — each does something different
Background job system that sends weekly digest emails to 10M users
Fraud detection pipeline that must audit every financial transaction for 5 years
Real-time bidding system where each ad impression must be handled by exactly ONE bidder
CDC pipeline streaming database row changes to Elasticsearch for search indexing

★

Uber Event Streaming Architecture

~3 hrs · full design

›

Context: 5M active drivers sending GPS every 4 seconds. Ride lifecycle events (requested, matched, started, completed, rated). Surge pricing recalculated per zone per minute. Real-time analytics + historical audit.

Design complete Kafka architecture. For each topic:

Topic name and purpose
Partition key choice and ordering guarantee
Delivery semantic (with justification)
Retention policy (with justification)
Which consumer groups consume it and what they do

Calculate: peak events/sec total, storage/day total, minimum partitions needed.

0 / 15 completedMODULE B4 · MESSAGE QUEUES

3 messaging models: queue, pub-sub, event stream — and when to use each

Kafka: topic → partition → offset → consumer group mental model

Consumer groups: how N groups read the same topic independently

Partition key: ordering guarantee, hotkey risk, hash(key) % N routing

At-most-once: auto-commit before processing — message loss scenario

At-least-once: commit after processing — duplication scenario

Exactly-once: idempotent producer + transactions — ~20% cost

Idempotent consumer pattern implementation in code

Kafka vs RabbitMQ: pull vs push, retention, replay, routing

Fan-out pattern: N consumer groups each getting all messages

Dead letter queue: purpose, after-N-retries flow, manual replay

Consumer lag = latest_offset − committed_offset; how to fix high lag

Key configs: acks=all, enable.idempotence, replication.factor, min.insync.replicas

✏️ Tasks 1–3: semantics, partition keys, Kafka vs Rabbit decisions

✏️ Capstone: Uber streaming architecture — full Kafka design with estimates

// NEXT MODULE

B5 — URL SHORTENER

          First end-to-end HLD case study · 300M URLs · 100:1 read:write ratio

          Short code generation (base62, MD5) · Redirect latency <10ms · Hot URL caching

          Analytics pipeline · Rate limiting · Custom aliases · TTL expiry