SYSTEM DESIGN MASTERY · TRACK C · MODULE C1 · WEEK 25 RAFT · PAXOS · ETCD · COCKROACHDB · KAFKA KRAFT

Advanced Distributed Systems · Consensus Theory · Production Systems

Consensus
Raft
& Paxos

FLP IMPOSSIBILITY · PAXOS PHASES · RAFT LEADER ELECTION
LOG REPLICATION · QUORUM MATH · ETCD · COCKROACHDB

⌊N/2⌋+1

QUORUM

NODE STATES

PAXOS PHASES

MODULE

FLP Impossibility

Paxos

Raft ★

Leader Election

Log Replication

Quorum Math

etcd · CockroachDB · KRaft

The Consensus Problem

Getting N nodes to agree on a value despite failures

What consensus must guaranteeDEFINITION

// Consensus: N nodes must agree on ONE value, even when nodes crash.

// Three properties required:
Agreement:   All non-faulty nodes decide on the same value.
Validity:    The decided value was proposed by some node (not fabricated).
Termination: All non-faulty nodes eventually decide (make progress).

// Safety vs Liveness:
Safety:   "Nothing bad ever happens."  (Agreement + Validity)
           Never return incorrect/conflicting results.
Liveness: "Something good eventually happens." (Termination)
           System eventually makes progress.

// FLP Impossibility (1985): In an ASYNC network with ≥1 possible crash,
// no algorithm can guarantee BOTH safety AND liveness.
//
// Raft and Paxos choose SAFETY over liveness:
// They may stall (no progress) during certain failure scenarios,
// but they will NEVER return incorrect results.
// This is the correct choice for databases and coordination services.

Why FLP matters in interviews: When asked "can your system guarantee both consistency and availability?", the correct answer is no — CAP theorem. FLP is the theoretical underpinning. Raft/Paxos are CP systems: they choose consistency (safety) over availability (liveness) during partitions.

Paxos — The Original

Four phases to agree on one value — elegant but notoriously hard to implement

PHASE 1

PREPARE

PROPOSER → ACCEPTORS

Proposer chooses unique proposal number N (higher than any previously seen).
Sends PREPARE(N) to majority of Acceptors.

Goal: "I want to propose a value. Will you listen to me (number N) and ignore lower numbers?"

PHASE 1

PROMISE

ACCEPTORS → PROPOSER

Each Acceptor: if N > previously promised number:
→ Responds PROMISE(N, accepted_value, accepted_number)
→ Promises never to accept proposals < N
→ Returns any value it already accepted (crucial for safety!)

If N ≤ promised: ignores or sends NACK.

PHASE 2

PROPOSER → ACCEPTORS

Proposer receives PROMISE from majority:
→ If any PROMISE included an already-accepted value: MUST use that value (not its own)
→ Else: use its own proposed value
Sends ACCEPT(N, value) to majority.

Safety invariant: using a previously-accepted value preserves agreement already reached.

PHASE 2

ACCEPTED

ACCEPTORS → LEARNERS

Each Acceptor: if N ≥ promised number:
→ Accepts value, sends ACCEPTED(N, value) to Learners

Learner receives ACCEPTED from majority → consensus reached!

Value is chosen. All future proposals will discover and preserve this value.

Why the "must use already-accepted value" rule? Imagine two proposers A and B racing. A gets value "X" accepted by nodes 1 and 2. B comes along and sees node 2's accepted value "X" in its PROMISE response. B must continue with "X" — not its own value. This prevents two different values from being chosen by different majorities. Safety preserved.

Raft — Designed for Understandability

Three sub-problems, one strong leader, no concurrent proposals

Follower

Passive.
Responds to
Leader/Candidate

Election timeout (no heartbeat)

→

Candidate

Running for
leader.
Sends votes.

Majority votes received

→

Leader

All writes.
Heartbeats.
Log sync.

Follower

Passive. Responds to AppendEntries and RequestVote RPCs. If no heartbeat within timeout (150–300ms, random): becomes Candidate.

Candidate

Increments term. Votes for self. Sends RequestVote to all. Gets majority → Leader. Hears from valid Leader → Follower. Timeout → new election.

Leader

Handles ALL client writes. Sends periodic heartbeats (AppendEntries with no entries). Replicates log. Steps down if higher term discovered.

Terms — Raft's logical clock: Every RPC includes the sender's term. If a node receives an RPC with a higher term, it immediately updates its term and reverts to Follower. This means a stale leader that reconnects after a network partition instantly steps down — it will always see a higher term from the new leader. No split-brain from reconnected old leaders.

Leader Election

Randomized timeouts prevent simultaneous elections — log completeness prevents stale winners

RequestVote RPC — the two critical checksELECTION PROTOCOL

// Follower starts election when election timeout expires (150–300ms, RANDOM per node)
// Randomization prevents all nodes timing out simultaneously

function startElection(node) {
  node.currentTerm += 1           // increment term
  node.state = 'CANDIDATE'
  node.votedFor = node.id           // vote for self
  votes = 1

  for each peer in cluster:
    response = peer.requestVote({
      term:         node.currentTerm,
      candidateId:  node.id,
      lastLogIndex: node.log.lastIndex(),
      lastLogTerm:  node.log.lastTerm()
    })

    if (response.voteGranted) votes++
    if (votes > cluster.size / 2) {
      node.state = 'LEADER'
      node.sendHeartbeats()  // immediately prevent new elections
      return
    }
}

// Voter grants vote ONLY IF both conditions hold:
function handleRequestVote(req) {
  // Condition 1: haven't voted this term yet
  if (votedFor != null && votedFor != req.candidateId) return {voteGranted: false}

  // Condition 2: candidate's log is AT LEAST as up-to-date as ours
  // (log completeness check — prevents stale node from winning)
  candidateUpToDate = req.lastLogTerm > myLastLogTerm ||
    (req.lastLogTerm == myLastLogTerm && req.lastLogIndex >= myLastLogIndex)

  if (!candidateUpToDate) return {voteGranted: false}

  votedFor = req.candidateId
  return {voteGranted: true}
}

Log completeness guarantee: A Candidate can only win if its log is as up-to-date as a majority of nodes. Since committed entries are on a majority of nodes, the winner is guaranteed to have all committed entries. This is why committed entries are never lost — any future leader will have them.

Log Replication & Healing

AppendEntries — the workhorse RPC that replicates and heals divergent logs

// LOG STATE — 5-node cluster after some entries (quorum = 3)

Leader L

t1:a=1

t1:b=2

t2:c=3

t2:d=4

← idx 1–3 committed (majority); idx 4 uncommitted

t1:a=1

t1:b=2

t2:c=3

t2:d=4

up to date

t1:a=1

t1:b=2

t2:c=3

missing idx 4 → L sends it

t1:a=1

t1:b=2

missing idx 3,4 → L sends both

t1:a=1

t1:b=2

t1:x=9

conflict at idx 3! L overwrites with t2:c=3

Log healing — how Leader brings divergent followers up to dateHEALING

// Leader tracks nextIndex[i] for each Follower (initially = lastLogIndex + 1)
// On AppendEntries rejection (consistency check failed):
//   Decrement nextIndex[i] and retry with older entries
//   Eventually: follower finds a matching point, then sync forward

// AppendEntries includes a consistency check:
AppendEntries({
  term:         currentTerm,
  leaderId:     myId,
  prevLogIndex: nextIndex[i] - 1,   // index of entry BEFORE new ones
  prevLogTerm:  log[prevLogIndex].term, // term of that entry
  entries:      log[nextIndex[i]...],   // entries to replicate
  leaderCommit: commitIndex             // highest committed index
})

// Follower F4 (conflict at idx 3):
// L sends AppendEntries with prevLogIndex=2, prevLogTerm=1
// F4 checks: log[2].term == 1? YES → accepts
// F4 replaces log[3] (t1:x=9) with (t2:c=3) — the conflict is overwritten
// F4 appends entries[4] = (t2:d=4)

// Key: committed entries are NEVER overwritten
// (idx 1,2 were committed — majority had them — F4's idx 3 conflict was NOT committed)

Quorum Math & Failure Tolerance

Quorum = ⌊N/2⌋ + 1 — always use odd-sized clusters

nodes

Q = 2

F = 1

dev / small

nodes

Q = 3

F = 2

★ production

nodes

Q = 4

F = 3

high durability

nodes

Q = 3

F = 1

✗ avoid: same F=1 as N=3, more overhead

nodes

Q = 4

F = 2

✗ avoid: same F=2 as N=5, more overhead

Split-brain prevention via quorumPARTITION SAFETY

// 5-node cluster partitioned into two groups:

Partition A: [Node 1, Node 2, Node 3]  ← has quorum (3 of 5) ✓
Partition B: [Node 4, Node 5]           ← no quorum (2 of 5) ✗

// Partition A: can elect leader, process writes → active
// Partition B: cannot reach quorum → cannot elect leader → rejects all writes

// Two active leaders simultaneously is IMPOSSIBLE:
// They would each need a majority of N nodes.
// Two separate majorities of N nodes requires 2 × (⌊N/2⌋ + 1) > N nodes.
// For N=5: 2 × 3 = 6 > 5. Impossible with only 5 nodes.
// ∴ At most ONE partition can ever have quorum. No split-brain. ∎

Real-World Systems

Where Raft and Paxos run in production — and how

etcd

KUBERNETES CONTROL PLANE · RAFT

Every Kubernetes object (Pod, Service, ConfigMap) lives in etcd. kubectl apply → API Server → etcd Raft leader → committed to majority → response.

3 nodes: F=1 · 5 nodes: F=2 ★ production
Linearizable reads: go through leader (fresh)
Serializable reads: any node (may be stale)
Rolling upgrade: 5-node cluster → 4 available while 1 restarts → quorum maintained

CockroachDB

DISTRIBUTED SQL · RAFT PER RANGE

Table data split into 64MB ranges. Each range has 3 replicas forming its own Raft group. A node is simultaneously Leader for some ranges, Follower for others.

Thousands of Raft groups on a 3-node cluster
Leaseholder (Raft leader) handles reads for range
Range splits when >64MB → two Raft groups
Writes: Raft committed → SQL response

Kafka KRaft

REPLACING ZOOKEEPER · KAFKA 3.x+

Kafka's own Raft implementation for broker metadata. Controller quorum (3–5 nodes) uses KRaft. Eliminates separate ZooKeeper cluster.

Pre-3.x: ZooKeeper for controller election
KRaft: metadata stored in __cluster_metadata topic
Failover: ~30s (ZK) → milliseconds (KRaft)
Simpler ops: one system instead of two

ZooKeeper / ZAB

PAXOS VARIANT · COORDINATION SERVICE

ZAB (ZooKeeper Atomic Broadcast) = Paxos variant. Still widely used for distributed locks, leader election, service discovery, config management.

ZAB vs Raft: new ZAB leader re-proposes all uncommitted entries from previous epoch; Raft leader commits new entries and old entries commit implicitly
Ephemeral znodes auto-deleted on session expire
Hadoop, HBase, older Kafka all depend on ZooKeeper

PROPERTY	PAXOS	RAFT
Leadership model	Any node can propose (no designated leader)	Strong single leader — only leader proposes
Log ordering	Complex — gaps allowed, holes possible	Sequential — no gaps, simple ordering
Understandability	Notoriously hard — vague on many details	Explicitly designed to be understandable
Livelock risk	Yes — two proposers can keep pre-empting each other	No — randomized timeouts prevent simultaneous elections
Reconfiguration	Not specified in original paper	Joint consensus specified and safe
Real implementations	Chubby (Google), ZAB (ZooKeeper)	etcd, CockroachDB, TiKV, Consul, KRaft

Interview Q&A

The consensus questions that appear in FAANG deep dives

"How does Raft prevent split-brain?"

›

Quorum requirement — a leader can only be elected and can only commit entries with acknowledgements from a majority (⌊N/2⌋+1) of nodes. In a network partition, only one partition can have a majority. Two simultaneous leaders would require two separate majorities, which requires more than N nodes total. For N=5: two majorities of 3 = 6 nodes needed, impossible with 5. At most one partition can ever have quorum — no split-brain.

"What happens if the Raft leader crashes mid-write?"

›

Two cases: (1) Entry not yet committed (not replicated to majority) — the new leader may not have it. From the client's perspective the write is lost (timeout); client must retry. (2) Entry was committed (majority had it) — the log completeness check in leader election guarantees only a candidate with all committed entries can win. The new leader will have the entry. Committed entries are never lost.

"Why use 5 nodes instead of 3 for etcd in production?"

›

3 nodes tolerates 1 failure. 5 nodes tolerates 2 simultaneous failures. In a 3-AZ cloud deployment: a 5-node cluster can lose an entire AZ (1-2 nodes) AND a second node in another AZ and still maintain quorum. For rolling upgrades: take 1 node offline → 4/5 available, quorum maintained throughout. Write latency cost (waiting for 3 vs 2 ACKs) is acceptable for a coordination service. For most production Kubernetes clusters, 5 etcd nodes is the standard.

"What is the difference between Raft and Paxos?"

›

Both solve consensus. Paxos allows any node to propose values — flexible but complex, especially for log replication where gaps and ordering become issues. It's also underspecified: leader election, reconfiguration, and gap handling are left to the implementer. Raft enforces a strong single leader — all writes go through the leader, the leader's log is always authoritative. Raft also explicitly specifies leader election (randomized timeouts), log replication (AppendEntries with consistency check), and cluster membership changes (joint consensus). Raft was designed to be understandable, which is why most modern systems choose it.

"What is the FLP impossibility result and why does it matter?"

›

FLP (Fischer, Lynch, Paterson 1985) proves that in a fully asynchronous network where even one node may crash, no consensus algorithm can guarantee both safety (correct results) AND liveness (eventual progress). This means you must choose: Raft and Paxos choose safety — they may stall during certain failure scenarios but will never return incorrect results. This is the correct trade-off for databases. It's also the theoretical foundation of the CAP theorem: CP systems (like etcd) choose consistency over availability during partitions.

"How does CockroachDB use Raft?"

›

CockroachDB shards table data into 64MB ranges, each with 3 replicas. Each range's replicas form an independent Raft group. A node simultaneously acts as Raft leader for some ranges and follower for others — load is naturally distributed. Writes to a key go to the Raft leader for that key's range, which replicates to the other two replicas and commits when 2 of 3 acknowledge. This gives CockroachDB serializable SQL transactions across a distributed cluster, with each range providing its own linearizable log.

Raft Election Simulation

~1.5 hrs

›

Simulate a 5-node Raft cluster on paper. Nodes: A (leader, term=1), B, C, D, E (all followers).

Leader A crashes. Election timeouts (random): B=180ms, C=250ms, D=300ms, E=350ms. Walk through exactly what messages B sends, what the others reply, and who wins.
During B's election, network partition isolates Node E completely. What does E do? What does it see when the partition heals?
A comes back online. Its term is still 1. The new leader is in term 2. What happens the moment A receives a heartbeat from the new leader?
Edge case: B and C both time out at exactly the same time (before receiving each other's RequestVote). Walk through the split-vote scenario and the resolution.

Log Replication & Healing

~1.5 hrs

›

5-node cluster. Leader L has log: [t1:a=1][t1:b=2][t2:c=3][t2:d=4]. Quorum = 3.

Determine which entries are committed. Show your quorum calculation using the follower states shown in the Log Replication tab.
For F4 (conflicting entry t1:x=9 at index 3): What is the AppendEntries message L sends to heal it? What does F4 do with its conflicting entry?
Where did F4's t1:x=9 come from? (Hint: think about a previous leader in term 1 that proposed it but crashed before committing.)
After healing, L commits entry t2:d=4. Walk through the commit protocol — which AppendEntries message triggers F2 and F3 to apply it to their state machines?

Quorum & Failure Scenarios

~1 hr

›

5-node etcd cluster across 3 AZs: AZ1 = nodes 1,2 | AZ2 = nodes 3,4 | AZ3 = node 5.

AZ2 fails entirely (nodes 3 and 4 go down). Can the remaining 3 nodes elect a leader and process writes? Calculate the quorum.
Network partition: {1, 3, 5} can reach each other; {2, 4} can reach each other. What happens in each partition? Which can elect a leader?
You need to restart one node for maintenance. How many can be simultaneously down while maintaining quorum?
You're expanding from 3 to 5 nodes. During the expansion, there's a brief window where the cluster is in a "joint consensus" state with both old (3) and new (5) configurations active. What are the risks? What would happen if the leader crashed exactly during this window?

★

Design etcd-Backed Leader Election for Microservices

~2 hrs

›

Design a leader election service for a microservice that must have exactly one active instance at a time (e.g., a Saga Orchestrator from B11).

Use etcd leases (TTL keys): describe the full election protocol — how does a node acquire leadership? What is the TTL and why?
The leader's network to etcd becomes slow (200ms latency, not broken). The lease TTL is 5s, heartbeat every 1s. What happens? Does the leader lose leadership?
The leader process crashes without releasing the lease. How long before a new leader is elected? How do you minimize this window?
Fencing problem: old leader (slow network) and new leader both think they are leader for a brief window. How do you prevent them from both writing to the same resource? (Hint: fencing tokens)
Compare to ZooKeeper ephemeral znodes: same pattern, but what happens if the ZooKeeper session times out vs if the etcd lease expires?

0 / 20 completedMODULE C1 · CONSENSUS

FLP impossibility: safety vs liveness trade-off, why Raft chooses safety

Paxos: Prepare → Promise → Accept → Accepted — all four phases and why

Paxos safety invariant: proposer must use already-accepted value

Paxos hard to implement: multi-Paxos, gaps, reconfiguration all unspecified

Raft: three sub-problems (election, log replication, safety)

Raft node states: Follower → Candidate → Leader, and all transitions

Terms as logical clock: higher-term RPC → immediately revert to Follower

Leader election: randomized timeouts, RequestVote, log completeness check

Log completeness guarantee: winner always has all committed entries

Log replication: AppendEntries, prevLogIndex/prevLogTerm consistency check

Commitment: entry committed when majority has it — never lost after that

Log healing: nextIndex, decrement on rejection, follower overwrites conflicts

Split-brain proof: two majorities require >N nodes — impossible

Quorum math: N=3→F=1, N=5→F=2; always odd-sized clusters

etcd: Raft for k8s control plane, 3 vs 5 nodes, linearizable vs serializable reads

CockroachDB: Raft per 64MB range, thousands of Raft groups, leaseholder

Kafka KRaft: Raft replacing ZooKeeper, faster failover, simpler ops

✏️ Task 1: Raft election simulation (5-node, crash, partition, stale leader)

✏️ Task 2: Log replication and healing (conflicting entries, commit protocol)

✏️ Task 4 (capstone): etcd-backed leader election with fencing tokens

// NEXT MODULE

C2 — Geo-Distribution & Multi-Region Architecture

      Active-active vs active-passive · CRDTs · Conflict resolution

      DynamoDB Global Tables · CockroachDB multi-region · Latency-based routing

      GDPR & data residency · RPO/RTO design · Cross-region replication lag

← B14 Kubernetes 📄 Study Notes ↑ Roadmap C2 Geo-Distribution →