Module B10: Consistent Hashing

The Modulo Hashing Problem

Why hash(key) % N breaks when N changes

Modulo Hashing — BROKEN

3 nodes: shard = hash(key) % 3
"user:123" → hash=456 → 456%3 = Node 0
"user:456" → hash=789 → 789%3 = Node 2

Add 4th node (N=3 → N=4):
789 % 4 = 1 → WRONG (was Node 2)
111 % 4 = 3 → WRONG (was Node 0)

~75% of all keys remapped instantly.
75% cache miss rate → thundering herd.

Consistent Hashing — SOLVED

Ring: positions [0, 2³²)
Node A → pos 12 · Node B → pos 45 · Node C → pos 78

"user:123" → pos=20 → clockwise → Node B (45)
"user:456" → pos=60 → clockwise → Node C (78)

Add Node D at position 55:
Only keys in range (45, 55] remapped → ~1/N ≈ 25%.

25% vs 75% — 3× fewer cache misses

The key number: Adding 1 node to an N-node cluster remaps ~1/(N+1) of all keys. For a 12-node cluster adding a 13th: only 7.7% of keys move. With modulo: 92.3% would move.

The Hash Ring

Nodes and keys share the same circular address space [0, 2³²)

// RING LEGEND

Physical nodes (A, B, C) placed by hash of node name

k:20 → clockwise → node at 45 = Node B

k:60 → clockwise → node at 78 = Node C

k:90 → clockwise → wraps to 12 = Node A

New Node D at pos 55: only keys in (45, 55] move.

        Node A owns: (78, 12] (wraps around 0)

        Node B owns: (12, 45]

        Node C owns: (45, 78]

Virtual Nodes

K positions per physical node → uniform load distribution

1 Vnode/Node

High variance — random positions lead to unequal arc sizes

10 Vnodes/Node

Better — converging toward equal, still some variance

150 Vnodes/Node ★

Near-uniform. Cassandra default. Std deviation <5%

Virtual node benefits beyond load balancingKEY INSIGHT

// BENEFIT 1: Even load (K=150 vnodes → each node ≈ 1/N share)
// BENEFIT 2: Adding node redistributes from ALL existing nodes evenly
// BENEFIT 3: Heterogeneous capacity
PowerfulNode (32 cores): 300 vnodes  → handles 2× data
StandardNode (16 cores): 150 vnodes  → handles 1× data
WeakNode     ( 8 cores):  75 vnodes  → handles 0.5× data
// Cassandra: 256 vnodes default. Redis Cluster: 16,384 hash slots.

Java Implementation

TreeMap provides O(log N) clockwise walk via ceilingEntry()

ConsistentHashRingJAVA

public class ConsistentHashRing {
    private final TreeMap<Integer, String> ring = new TreeMap<>();
    private final int VNODES = 150;

    private int hash(String key) {
        return Hashing.murmur3_32().hashString(key, UTF_8).asInt();
    }

    public void addNode(String node) {
        for (int v = 0; v < VNODES; v++) ring.put(hash(node + "#" + v), node);
    }

    public void removeNode(String node) {
        for (int v = 0; v < VNODES; v++) ring.remove(hash(node + "#" + v));
    }

    public String getNode(String key) {
        if (ring.isEmpty()) throw new IllegalStateException("No nodes");
        Map.Entry<Integer, String> e = ring.ceilingEntry(hash(key));
        return (e != null ? e : ring.firstEntry()).getValue();
    }
}

Real-World Use Cases

Where consistent hashing solves the node change problem

SYSTEM	HOW IT USES CONSISTENT HASHING	SPECIFICS
Redis Cluster	16,384 hash slots across nodes. CRC16(key) % 16,384 → slot → node.	Hash tags: {user_id} forces related keys to same slot.
Apache Cassandra	256 vnodes/node. Token range owns ring slice. Reads/writes to token owner + replicas.	RF=3 → 3 consecutive ring nodes own each key.
Amazon DynamoDB	Consistent hashing for partition routing. Auto-scaling splits ranges.	Consistent hashing minimises rebalancing on partition add.
CDN (Akamai)	Consistent hashing within PoP across cache servers.	Same URL → same cache server → higher hit rate.
Sticky LB	Same user → same backend server → in-memory session cache works.	Server failure: only that server's users remapped.

Service Discovery Patterns

How Service A finds Service B in a dynamic microservices environment

Client-Side Discovery

CLIENT QUERIES REGISTRY DIRECTLY

Client queries registry → gets instance list → load balances (round-robin/random) → connects directly.

✓ No extra network hop
✓ Client controls LB strategy

✗ LB logic in every client
✗ Multiple language SDKs needed

Server-Side Discovery ★

GATEWAY HANDLES ROUTING

Client calls gateway → gateway queries registry → gateway routes to healthy instance. Clients need no discovery logic.

✓ Simple clients
✓ Central LB policy
✓ Works across languages

✗ Extra hop through gateway
✗ Gateway is critical path

DNS-Based Discovery

SERVICE = DNS A RECORD

Registry publishes IPs as DNS A records. Kubernetes CoreDNS handles this automatically.

✓ Standard protocol
✓ Kubernetes native

✗ DNS TTL caching = stale IPs
✗ No health awareness in DNS

Kubernetes DNS discoveryKUBERNETES

# Format: {service}.{namespace}.svc.cluster.local
curl http://payment-service.default.svc.cluster.local:8080/charge
# CoreDNS → ClusterIP → kube-proxy → healthy pod
# Pod fails → Endpoints controller removes it → no traffic routed

Consul & ZooKeeper

Production service registries — choose by consistency needs

Consul — register + discover healthy instancesCONSUL HTTP API

PUT /v1/agent/service/register
{ "Name": "payment-service", "ID": "payment-1",
  "Address": "10.0.1.23", "Port": 8080,
  "Check": { "HTTP": "http://10.0.1.23:8080/health", "Interval": "10s" } }

GET /v1/health/service/payment-service?passing=true
// → returns healthy instances only

GET /v1/health/service/payment-service?passing=true&index=50&wait=30s
// → long-poll: blocks until change or timeout → client auto-refreshes

ZooKeeper — ephemeral znodesZOOKEEPER

create /services/payment/instance-1  "10.0.1.23:8080"  [EPHEMERAL]
// Ephemeral = auto-deleted on session expire (service crash)
getChildren /services/payment [WATCH]
// → client notified when instance created/deleted

FEATURE	CONSUL	ZOOKEEPER	ETCD (k8s)
Consistency	Raft (strong)	ZAB (strong)	Raft (strong)
Use case	Service mesh, multi-DC	Locks, leader election	Kubernetes control plane
Health checks	Built-in (HTTP/TCP)	Session timeout only	Leases (TTL keys)

Health Checks

Registry must know which instances are healthy

/health endpoint standardHTTP

// GET /health → 200 if healthy, 503 if not
{ "status": "healthy", "checks": {
    "database": "connected",
    "redis":    "connected",
    "disk":     "98% free"
  }, "version": "2.3.1" }

// Registry sees 503 → removes from routing pool
// Consul default: 2 consecutive failures → deregister
// Check interval 10s → failure detected within 20–30s

Liveness vs Readiness (Kubernetes): Liveness = process alive? (restart if not). Readiness = ready for traffic? (remove from LB if not). Only readiness affects service discovery routing.

Gossip Protocol

Eventual consistency for cluster membership — O(log N) propagation

// GOSSIP PROPAGATION — 16 nodes, K=3 neighbors per round

Round 1:

N1★

N5★

N11★

N2

N3

3/16 nodes know

Round 2:

N1★

N5★

N11★

N2★

N7★

N14★

N9★

9/16 nodes know

Round 3:

ALL 16★

16/16 know · O(log₃ 16) ≈ 2.5 rounds

Gossip vs RaftCOMPARISON

// GOSSIP: eventually consistent, O(log N), no leader, scales to 1000s of nodes
// Use for: cluster membership, failure detection

// RAFT: strongly consistent, leader-based, quorum required, ~5-7 nodes practical
// Use for: config store, leader election, distributed locks

// Consul uses BOTH:
//   Gossip (SWIM) for membership + failure detection
//   Raft for KV store + service catalog consistency

01

Consistent Hash Ring Implementation

~2 hrs

›

Implement ConsistentHashRing with addNode, removeNode, getNode using TreeMap
Test with 3 nodes and 1,000 random keys. Verify ~33% per node.
Add a 4th node. Verify ~25% of keys remapped (not 75%).
Remove a node. Verify only that node's keys remapped.
What if two nodes hash to the same ring position? Handle collisions.

02

Virtual Node Distribution Analysis

~1.5 hrs

›

Simulate 3 nodes with 1, 10, 50, 150, 300 vnodes each.
Hash 10,000 random keys in each configuration.
Calculate std deviation across nodes for each vnode count.
At what vnode count does std dev fall below 5%?
At K=150, how does adding a 4th node compare to modulo hashing?

03

Service Discovery System Design

~1.5 hrs

›

Design service discovery for 50 microservices, each with 3–10 instances:

Client-side vs server-side discovery — argue both, pick one, justify.
Pod starts at 9:00:00. What happens step-by-step until first request?
Pod dies without SIGTERM at 9:05:00. When do errors stop? What is the gap?
Network partition: 2 of 5 Consul nodes can't reach the other 3. Who serves?
What does /health check? What thresholds trigger unhealthy?

★

Add Consistent Hashing to URL Shortener

~2 hrs

›

URL shortener from B5 + B9 now has 5 Redis cache nodes:

How many vnodes per Redis node? Justify.
Redis node 3 fails. Miss rate impact vs modulo hashing?
Add Redis node 6. What % of cached URLs need to be reloaded from DB?
What would modulo hashing do when node 3 fails?
Design zero-downtime transition from modulo to consistent hashing.

0 / 19 completedMODULE B10 · CONSISTENT HASHING

Modulo flaw: adding 1 node remaps ~N/(N+1) ≈ all keys

Hash ring: both nodes and keys mapped to [0, 2³²)

Clockwise walk: ceilingEntry() or firstEntry() to wrap

Adding node: only 1/(N+1) keys remapped

Virtual nodes: K positions per physical node → uniform distribution

Vnodes: 150 standard (Cassandra: 256, Redis: 16,384 slots)

Heterogeneous nodes: weighted vnodes (2× hardware → 2× vnodes)

TreeMap: addNode, removeNode, getNode — O(log N)

Use cases: Redis Cluster, Cassandra, CDN, sticky LB

Service discovery registry pattern: register, query, health check

Client-side vs server-side discovery trade-offs

DNS-based: Kubernetes CoreDNS, service.namespace.svc.cluster.local

Consul: HTTP register + health check + passing=true + watch

ZooKeeper: ephemeral znodes auto-deleted on session expire

Health check: 200 healthy / 503 unhealthy; 2-failure threshold

Liveness vs readiness — different purposes, different consequences

Gossip: O(log N) propagation, eventual consistency, no leader

✏️ Tasks 1–3: ring impl, vnode analysis, service discovery

✏️ Task 4 (capstone): consistent hashing added to URL shortener

// NEXT MODULE

B11 — ACID, Distributed Transactions & Saga

      ACID properties · 2-Phase Commit · Distributed transactions

      Saga pattern (choreography vs orchestration) · Compensating transactions

      Outbox pattern · Idempotency · BASE vs ACID trade-offs