The Consensus Problem
Getting N nodes to agree on a value despite failures
What consensus must guaranteeDEFINITION
// Consensus: N nodes must agree on ONE value, even when nodes crash. // Three properties required: Agreement: All non-faulty nodes decide on the same value. Validity: The decided value was proposed by some node (not fabricated). Termination: All non-faulty nodes eventually decide (make progress). // Safety vs Liveness: Safety: "Nothing bad ever happens." (Agreement + Validity) Never return incorrect/conflicting results. Liveness: "Something good eventually happens." (Termination) System eventually makes progress. // FLP Impossibility (1985): In an ASYNC network with ≥1 possible crash, // no algorithm can guarantee BOTH safety AND liveness. // // Raft and Paxos choose SAFETY over liveness: // They may stall (no progress) during certain failure scenarios, // but they will NEVER return incorrect results. // This is the correct choice for databases and coordination services.
Why FLP matters in interviews: When asked "can your system guarantee both consistency and availability?", the correct answer is no — CAP theorem. FLP is the theoretical underpinning. Raft/Paxos are CP systems: they choose consistency (safety) over availability (liveness) during partitions.
Paxos — The Original
Four phases to agree on one value — elegant but notoriously hard to implement
PHASE 1
PREPARE
PROPOSER → ACCEPTORS
Proposer chooses unique proposal number N (higher than any previously seen).
Sends PREPARE(N) to majority of Acceptors.
Goal: "I want to propose a value. Will you listen to me (number N) and ignore lower numbers?"
Sends PREPARE(N) to majority of Acceptors.
Goal: "I want to propose a value. Will you listen to me (number N) and ignore lower numbers?"
PHASE 1
PROMISE
ACCEPTORS → PROPOSER
Each Acceptor: if N > previously promised number:
→ Responds PROMISE(N, accepted_value, accepted_number)
→ Promises never to accept proposals < N
→ Returns any value it already accepted (crucial for safety!)
If N ≤ promised: ignores or sends NACK.
→ Responds PROMISE(N, accepted_value, accepted_number)
→ Promises never to accept proposals < N
→ Returns any value it already accepted (crucial for safety!)
If N ≤ promised: ignores or sends NACK.
PHASE 2
ACCEPT
PROPOSER → ACCEPTORS
Proposer receives PROMISE from majority:
→ If any PROMISE included an already-accepted value: MUST use that value (not its own)
→ Else: use its own proposed value
Sends ACCEPT(N, value) to majority.
Safety invariant: using a previously-accepted value preserves agreement already reached.
→ If any PROMISE included an already-accepted value: MUST use that value (not its own)
→ Else: use its own proposed value
Sends ACCEPT(N, value) to majority.
Safety invariant: using a previously-accepted value preserves agreement already reached.
PHASE 2
ACCEPTED
ACCEPTORS → LEARNERS
Each Acceptor: if N ≥ promised number:
→ Accepts value, sends ACCEPTED(N, value) to Learners
Learner receives ACCEPTED from majority → consensus reached!
Value is chosen. All future proposals will discover and preserve this value.
→ Accepts value, sends ACCEPTED(N, value) to Learners
Learner receives ACCEPTED from majority → consensus reached!
Value is chosen. All future proposals will discover and preserve this value.
Why the "must use already-accepted value" rule? Imagine two proposers A and B racing. A gets value "X" accepted by nodes 1 and 2. B comes along and sees node 2's accepted value "X" in its PROMISE response. B must continue with "X" — not its own value. This prevents two different values from being chosen by different majorities. Safety preserved.
Raft — Designed for Understandability
Three sub-problems, one strong leader, no concurrent proposals
Follower
Passive.
Responds to
Leader/Candidate
Responds to
Leader/Candidate
Election timeout (no heartbeat)
→
Candidate
Running for
leader.
Sends votes.
leader.
Sends votes.
Majority votes received
→
Leader
All writes.
Heartbeats.
Log sync.
Heartbeats.
Log sync.
Follower
Passive. Responds to AppendEntries and RequestVote RPCs. If no heartbeat within timeout (150–300ms, random): becomes Candidate.
Candidate
Increments term. Votes for self. Sends RequestVote to all. Gets majority → Leader. Hears from valid Leader → Follower. Timeout → new election.
Leader
Handles ALL client writes. Sends periodic heartbeats (AppendEntries with no entries). Replicates log. Steps down if higher term discovered.
Terms — Raft's logical clock: Every RPC includes the sender's term. If a node receives an RPC with a higher term, it immediately updates its term and reverts to Follower. This means a stale leader that reconnects after a network partition instantly steps down — it will always see a higher term from the new leader. No split-brain from reconnected old leaders.
Leader Election
Randomized timeouts prevent simultaneous elections — log completeness prevents stale winners
RequestVote RPC — the two critical checksELECTION PROTOCOL
// Follower starts election when election timeout expires (150–300ms, RANDOM per node) // Randomization prevents all nodes timing out simultaneously function startElection(node) { node.currentTerm += 1 // increment term node.state = 'CANDIDATE' node.votedFor = node.id // vote for self votes = 1 for each peer in cluster: response = peer.requestVote({ term: node.currentTerm, candidateId: node.id, lastLogIndex: node.log.lastIndex(), lastLogTerm: node.log.lastTerm() }) if (response.voteGranted) votes++ if (votes > cluster.size / 2) { node.state = 'LEADER' node.sendHeartbeats() // immediately prevent new elections return } } // Voter grants vote ONLY IF both conditions hold: function handleRequestVote(req) { // Condition 1: haven't voted this term yet if (votedFor != null && votedFor != req.candidateId) return {voteGranted: false} // Condition 2: candidate's log is AT LEAST as up-to-date as ours // (log completeness check — prevents stale node from winning) candidateUpToDate = req.lastLogTerm > myLastLogTerm || (req.lastLogTerm == myLastLogTerm && req.lastLogIndex >= myLastLogIndex) if (!candidateUpToDate) return {voteGranted: false} votedFor = req.candidateId return {voteGranted: true} }
Log completeness guarantee: A Candidate can only win if its log is as up-to-date as a majority of nodes. Since committed entries are on a majority of nodes, the winner is guaranteed to have all committed entries. This is why committed entries are never lost — any future leader will have them.
Log Replication & Healing
AppendEntries — the workhorse RPC that replicates and heals divergent logs
// LOG STATE — 5-node cluster after some entries (quorum = 3)
Leader L
t1:a=1
t1:b=2
t2:c=3
t2:d=4
← idx 1–3 committed (majority); idx 4 uncommitted
F1
t1:a=1
t1:b=2
t2:c=3
t2:d=4
up to date
F2
t1:a=1
t1:b=2
t2:c=3
?
missing idx 4 → L sends it
F3
t1:a=1
t1:b=2
?
?
missing idx 3,4 → L sends both
F4
t1:a=1
t1:b=2
t1:x=9
?
conflict at idx 3! L overwrites with t2:c=3
Log healing — how Leader brings divergent followers up to dateHEALING
// Leader tracks nextIndex[i] for each Follower (initially = lastLogIndex + 1) // On AppendEntries rejection (consistency check failed): // Decrement nextIndex[i] and retry with older entries // Eventually: follower finds a matching point, then sync forward // AppendEntries includes a consistency check: AppendEntries({ term: currentTerm, leaderId: myId, prevLogIndex: nextIndex[i] - 1, // index of entry BEFORE new ones prevLogTerm: log[prevLogIndex].term, // term of that entry entries: log[nextIndex[i]...], // entries to replicate leaderCommit: commitIndex // highest committed index }) // Follower F4 (conflict at idx 3): // L sends AppendEntries with prevLogIndex=2, prevLogTerm=1 // F4 checks: log[2].term == 1? YES → accepts // F4 replaces log[3] (t1:x=9) with (t2:c=3) — the conflict is overwritten // F4 appends entries[4] = (t2:d=4) // Key: committed entries are NEVER overwritten // (idx 1,2 were committed — majority had them — F4's idx 3 conflict was NOT committed)
Quorum Math & Failure Tolerance
Quorum = ⌊N/2⌋ + 1 — always use odd-sized clusters
3
nodes
Q = 2
F = 1
dev / small
5
nodes
Q = 3
F = 2
★ production
7
nodes
Q = 4
F = 3
high durability
4
nodes
Q = 3
F = 1
✗ avoid: same F=1 as N=3, more overhead
6
nodes
Q = 4
F = 2
✗ avoid: same F=2 as N=5, more overhead
Split-brain prevention via quorumPARTITION SAFETY
// 5-node cluster partitioned into two groups: Partition A: [Node 1, Node 2, Node 3] ← has quorum (3 of 5) ✓ Partition B: [Node 4, Node 5] ← no quorum (2 of 5) ✗ // Partition A: can elect leader, process writes → active // Partition B: cannot reach quorum → cannot elect leader → rejects all writes // Two active leaders simultaneously is IMPOSSIBLE: // They would each need a majority of N nodes. // Two separate majorities of N nodes requires 2 × (⌊N/2⌋ + 1) > N nodes. // For N=5: 2 × 3 = 6 > 5. Impossible with only 5 nodes. // ∴ At most ONE partition can ever have quorum. No split-brain. ∎
Real-World Systems
Where Raft and Paxos run in production — and how
etcd
KUBERNETES CONTROL PLANE · RAFT
Every Kubernetes object (Pod, Service, ConfigMap) lives in etcd. kubectl apply → API Server → etcd Raft leader → committed to majority → response.
3 nodes: F=1 · 5 nodes: F=2 ★ production
Linearizable reads: go through leader (fresh)
Serializable reads: any node (may be stale)
Rolling upgrade: 5-node cluster → 4 available while 1 restarts → quorum maintained
Linearizable reads: go through leader (fresh)
Serializable reads: any node (may be stale)
Rolling upgrade: 5-node cluster → 4 available while 1 restarts → quorum maintained
CockroachDB
DISTRIBUTED SQL · RAFT PER RANGE
Table data split into 64MB ranges. Each range has 3 replicas forming its own Raft group. A node is simultaneously Leader for some ranges, Follower for others.
Thousands of Raft groups on a 3-node cluster
Leaseholder (Raft leader) handles reads for range
Range splits when >64MB → two Raft groups
Writes: Raft committed → SQL response
Leaseholder (Raft leader) handles reads for range
Range splits when >64MB → two Raft groups
Writes: Raft committed → SQL response
Kafka KRaft
REPLACING ZOOKEEPER · KAFKA 3.x+
Kafka's own Raft implementation for broker metadata. Controller quorum (3–5 nodes) uses KRaft. Eliminates separate ZooKeeper cluster.
Pre-3.x: ZooKeeper for controller election
KRaft: metadata stored in __cluster_metadata topic
Failover: ~30s (ZK) → milliseconds (KRaft)
Simpler ops: one system instead of two
KRaft: metadata stored in __cluster_metadata topic
Failover: ~30s (ZK) → milliseconds (KRaft)
Simpler ops: one system instead of two
ZooKeeper / ZAB
PAXOS VARIANT · COORDINATION SERVICE
ZAB (ZooKeeper Atomic Broadcast) = Paxos variant. Still widely used for distributed locks, leader election, service discovery, config management.
ZAB vs Raft: new ZAB leader re-proposes all uncommitted entries from previous epoch; Raft leader commits new entries and old entries commit implicitly
Ephemeral znodes auto-deleted on session expire
Hadoop, HBase, older Kafka all depend on ZooKeeper
Ephemeral znodes auto-deleted on session expire
Hadoop, HBase, older Kafka all depend on ZooKeeper
| PROPERTY | PAXOS | RAFT |
|---|---|---|
| Leadership model | Any node can propose (no designated leader) | Strong single leader — only leader proposes |
| Log ordering | Complex — gaps allowed, holes possible | Sequential — no gaps, simple ordering |
| Understandability | Notoriously hard — vague on many details | Explicitly designed to be understandable |
| Livelock risk | Yes — two proposers can keep pre-empting each other | No — randomized timeouts prevent simultaneous elections |
| Reconfiguration | Not specified in original paper | Joint consensus specified and safe |
| Real implementations | Chubby (Google), ZAB (ZooKeeper) | etcd, CockroachDB, TiKV, Consul, KRaft |
Interview Q&A
The consensus questions that appear in FAANG deep dives
"How does Raft prevent split-brain?"
›
Quorum requirement — a leader can only be elected and can only commit entries with acknowledgements from a majority (⌊N/2⌋+1) of nodes. In a network partition, only one partition can have a majority. Two simultaneous leaders would require two separate majorities, which requires more than N nodes total. For N=5: two majorities of 3 = 6 nodes needed, impossible with 5. At most one partition can ever have quorum — no split-brain.
"What happens if the Raft leader crashes mid-write?"
›
Two cases: (1) Entry not yet committed (not replicated to majority) — the new leader may not have it. From the client's perspective the write is lost (timeout); client must retry. (2) Entry was committed (majority had it) — the log completeness check in leader election guarantees only a candidate with all committed entries can win. The new leader will have the entry. Committed entries are never lost.
"Why use 5 nodes instead of 3 for etcd in production?"
›
3 nodes tolerates 1 failure. 5 nodes tolerates 2 simultaneous failures. In a 3-AZ cloud deployment: a 5-node cluster can lose an entire AZ (1-2 nodes) AND a second node in another AZ and still maintain quorum. For rolling upgrades: take 1 node offline → 4/5 available, quorum maintained throughout. Write latency cost (waiting for 3 vs 2 ACKs) is acceptable for a coordination service. For most production Kubernetes clusters, 5 etcd nodes is the standard.
"What is the difference between Raft and Paxos?"
›
Both solve consensus. Paxos allows any node to propose values — flexible but complex, especially for log replication where gaps and ordering become issues. It's also underspecified: leader election, reconfiguration, and gap handling are left to the implementer. Raft enforces a strong single leader — all writes go through the leader, the leader's log is always authoritative. Raft also explicitly specifies leader election (randomized timeouts), log replication (AppendEntries with consistency check), and cluster membership changes (joint consensus). Raft was designed to be understandable, which is why most modern systems choose it.
"What is the FLP impossibility result and why does it matter?"
›
FLP (Fischer, Lynch, Paterson 1985) proves that in a fully asynchronous network where even one node may crash, no consensus algorithm can guarantee both safety (correct results) AND liveness (eventual progress). This means you must choose: Raft and Paxos choose safety — they may stall during certain failure scenarios but will never return incorrect results. This is the correct trade-off for databases. It's also the theoretical foundation of the CAP theorem: CP systems (like etcd) choose consistency over availability during partitions.
"How does CockroachDB use Raft?"
›
CockroachDB shards table data into 64MB ranges, each with 3 replicas. Each range's replicas form an independent Raft group. A node simultaneously acts as Raft leader for some ranges and follower for others — load is naturally distributed. Writes to a key go to the Raft leader for that key's range, which replicates to the other two replicas and commits when 2 of 3 acknowledge. This gives CockroachDB serializable SQL transactions across a distributed cluster, with each range providing its own linearizable log.
1
Raft Election Simulation
›
Simulate a 5-node Raft cluster on paper. Nodes: A (leader, term=1), B, C, D, E (all followers).
- Leader A crashes. Election timeouts (random): B=180ms, C=250ms, D=300ms, E=350ms. Walk through exactly what messages B sends, what the others reply, and who wins.
- During B's election, network partition isolates Node E completely. What does E do? What does it see when the partition heals?
- A comes back online. Its term is still 1. The new leader is in term 2. What happens the moment A receives a heartbeat from the new leader?
- Edge case: B and C both time out at exactly the same time (before receiving each other's RequestVote). Walk through the split-vote scenario and the resolution.
2
Log Replication & Healing
›
5-node cluster. Leader L has log: [t1:a=1][t1:b=2][t2:c=3][t2:d=4]. Quorum = 3.
- Determine which entries are committed. Show your quorum calculation using the follower states shown in the Log Replication tab.
- For F4 (conflicting entry
t1:x=9at index 3): What is the AppendEntries message L sends to heal it? What does F4 do with its conflicting entry? - Where did F4's
t1:x=9come from? (Hint: think about a previous leader in term 1 that proposed it but crashed before committing.) - After healing, L commits entry
t2:d=4. Walk through the commit protocol — which AppendEntries message triggers F2 and F3 to apply it to their state machines?
3
Quorum & Failure Scenarios
›
5-node etcd cluster across 3 AZs: AZ1 = nodes 1,2 | AZ2 = nodes 3,4 | AZ3 = node 5.
- AZ2 fails entirely (nodes 3 and 4 go down). Can the remaining 3 nodes elect a leader and process writes? Calculate the quorum.
- Network partition: {1, 3, 5} can reach each other; {2, 4} can reach each other. What happens in each partition? Which can elect a leader?
- You need to restart one node for maintenance. How many can be simultaneously down while maintaining quorum?
- You're expanding from 3 to 5 nodes. During the expansion, there's a brief window where the cluster is in a "joint consensus" state with both old (3) and new (5) configurations active. What are the risks? What would happen if the leader crashed exactly during this window?
★
Design etcd-Backed Leader Election for Microservices
›
Design a leader election service for a microservice that must have exactly one active instance at a time (e.g., a Saga Orchestrator from B11).
- Use etcd leases (TTL keys): describe the full election protocol — how does a node acquire leadership? What is the TTL and why?
- The leader's network to etcd becomes slow (200ms latency, not broken). The lease TTL is 5s, heartbeat every 1s. What happens? Does the leader lose leadership?
- The leader process crashes without releasing the lease. How long before a new leader is elected? How do you minimize this window?
- Fencing problem: old leader (slow network) and new leader both think they are leader for a brief window. How do you prevent them from both writing to the same resource? (Hint: fencing tokens)
- Compare to ZooKeeper ephemeral znodes: same pattern, but what happens if the ZooKeeper session times out vs if the etcd lease expires?
0 / 20 completedMODULE C1 · CONSENSUS
FLP impossibility: safety vs liveness trade-off, why Raft chooses safety
Paxos: Prepare → Promise → Accept → Accepted — all four phases and why
Paxos safety invariant: proposer must use already-accepted value
Paxos hard to implement: multi-Paxos, gaps, reconfiguration all unspecified
Raft: three sub-problems (election, log replication, safety)
Raft node states: Follower → Candidate → Leader, and all transitions
Terms as logical clock: higher-term RPC → immediately revert to Follower
Leader election: randomized timeouts, RequestVote, log completeness check
Log completeness guarantee: winner always has all committed entries
Log replication: AppendEntries, prevLogIndex/prevLogTerm consistency check
Commitment: entry committed when majority has it — never lost after that
Log healing: nextIndex, decrement on rejection, follower overwrites conflicts
Split-brain proof: two majorities require >N nodes — impossible
Quorum math: N=3→F=1, N=5→F=2; always odd-sized clusters
etcd: Raft for k8s control plane, 3 vs 5 nodes, linearizable vs serializable reads
CockroachDB: Raft per 64MB range, thousands of Raft groups, leaseholder
Kafka KRaft: Raft replacing ZooKeeper, faster failover, simpler ops
✏️ Task 1: Raft election simulation (5-node, crash, partition, stale leader)
✏️ Task 2: Log replication and healing (conflicting entries, commit protocol)
✏️ Task 4 (capstone): etcd-backed leader election with fencing tokens
// NEXT MODULE
C2 — Geo-Distribution & Multi-Region Architecture
Active-active vs active-passive · CRDTs · Conflict resolution
DynamoDB Global Tables · CockroachDB multi-region · Latency-based routing
GDPR & data residency · RPO/RTO design · Cross-region replication lag
DynamoDB Global Tables · CockroachDB multi-region · Latency-based routing
GDPR & data residency · RPO/RTO design · Cross-region replication lag