Module C2: Geo-Distribution & Multi-Region Architecture

Why Go Multi-Region?

Three distinct motivations — each requires a different solution

Availability

SURVIVE REGION OUTAGE

AWS us-east-1 has gone down multiple times (2017, 2021, 2023). A single-region system goes down with it.

Solution:
Active-passive failover
Active-active with replication

Latency

SERVE NEARBY USERS

Tokyo user hitting us-east-1: ~140ms. Tokyo user hitting ap-northeast-1: ~5ms. For writes: 28× difference.

Solution:
Read replicas in each region
Regional active-active writes

Compliance

DATA RESIDENCY LAWS

GDPR (EU), India PDPB, China PIPL all require personal data to remain within borders. Violating = large fines.

Solution:
Regional data partitioning
Geo-routing + KMS per region

Interview clarifying question: "Before I design the multi-region architecture — which of these are we solving? Availability requires hot standby and automatic failover. Latency requires regional replicas or active-active. Compliance requires strict data partitioning and geo-routing. They have very different solutions and cost profiles."

Speed of Light

The unavoidable physical constraint — these numbers drive every multi-region decision

// REAL-WORLD ROUND-TRIP TIMES (approximate)

Within AZ

~0.5ms

baseline

Cross-AZ

~1–2ms

fast

US-East ↔ US-West

~70ms

noticeable

US-East ↔ EU-West

~85ms

slow

US-East ↔ APAC

~140ms

very slow

EU-West ↔ APAC

~130ms

very slow

EU-West ↔ EU-Central

~20ms

ok

Within APAC

~50–80ms

varies

Latency math: why synchronous global consensus is impracticalCALCULATION

// Raft cluster: US-East (leader), EU-West, AP-Southeast
// Quorum = 2 of 3. Client writes from US-East.

Write latency = time until quorum ACK
  US-East → EU-West round trip:  ~85ms
  US-East → AP-Southeast RT:     ~140ms

Quorum write latency = min(EU-West RTT, AP-Southeast RTT) = ~85ms
// must wait for 1 of 2 remote nodes to ACK

// This means every write to your database takes 85ms minimum.
// A 10ms SLA is IMPOSSIBLE with synchronous global Raft.

// Solution: non-voting replicas (CockroachDB style)
// US-East: 2 voting replicas (quorum = 2, all local → ~1ms write latency)
// EU-West: 1 non-voting replica (async replication, ~1s lag)
// AP-Southeast: 1 non-voting replica (async replication, ~1s lag)
// Write latency: ~1ms (local quorum) ✓
// EU/APAC reads: served locally (slightly stale) ✓

Active-Passive vs Active-Active

The fundamental choice — consistency vs availability under partition

Active-Passive

ONE PRIMARY · N STANDBYS

Primary serves all reads and writes. Standby(s) replicate from primary but serve nothing normally. On primary failure: standby is promoted.

Sync replication: primary waits for standby ACK before committing. RPO=0, but adds cross-region latency to every write.

Async replication: primary commits immediately, replication in background. RPO=seconds, no write latency penalty. Standard choice.

✓ Simple to reason about
✓ No conflict resolution needed
✓ Strong consistency always

✗ All writes go to one region
✗ Remote users pay write latency
✗ Failover gap: 30s–minutes
✗ Risk: split-brain on failover

Active-Active

ALL REGIONS READ + WRITE

Multiple regions all serve reads AND writes. No primary. All regions are peers. Writes in different regions must converge.

Conflict resolution required: Last-Write-Wins (LWW), vector clocks, CRDTs, or Operational Transform. Each has different trade-offs.

Eventually consistent: after a write in EU, the US region may be ~1s behind. Briefly inconsistent across regions.

✓ Low write latency everywhere
✓ Survives full region loss
✓ No bottleneck primary region

✗ Conflict resolution complexity
✗ Eventually consistent
✗ Some conflicts = silent data loss (LWW)
✗ Harder to debug

CONFLICT STRATEGY	HOW IT WORKS	DATA LOSS?	BEST FOR
Last-Write-Wins	Highest timestamp wins on conflict	Yes — concurrent write silently lost	Sessions, caches, preferences
Vector Clocks	Track causality; surface conflicts to app	No — app resolves explicitly	Shopping carts (Dynamo original)
CRDTs	Mathematically conflict-free merge	Never — by design	Counters, sets, collaborative editing
Operational Transform	Transform ops against concurrent ops	Never — but complex	Real-time collaborative text (Google Docs)

CRDTs — Conflict-Free Replicated Data Types

Mathematical guarantee: any two replicas merged in any order → same result

The three properties: merge must be Commutative (order doesn't matter), Associative (grouping doesn't matter), Idempotent (merging twice = merging once). Any data structure satisfying these three is a CRDT — concurrent updates across replicas always converge.

G-Counter

GROW-ONLY COUNTER

Vector of counts, one slot per node. Node i only increments slot i. Merge = max of each slot. Value = sum all slots.

State: [3, 5, 2]
Merge: max per slot
Value: sum = 10

view counts, like counts, event totals

PN-Counter

INCREMENT + DECREMENT

Two G-Counters: P (increments) and N (decrements). Value = sum(P) - sum(N). Allows decrement while remaining CRDT.

P: [3,5,2] N: [1,2,0]
Value = 10 - 3 = 7

shopping cart quantities, inventory approximation

G-Set

GROW-ONLY SET

Elements can only be added, never removed. Merge = union of both sets. Trivially conflict-free.

A: {x,y} B: {y,z}
Merge: {x,y,z}

tag sets, immutable membership lists

2P-Set

ADD + PERMANENT REMOVE

Two G-Sets: Add-set A and Remove-set R. Element present if: in A AND NOT in R. Once removed, cannot re-add.

Present = A \ R
Remove is permanent

when re-add not needed: banned users, archived items

OR-Set

OBSERVED-REMOVE SET

Each add tagged with unique ID. Remove only removes elements with that specific tag. Allows correct add-remove-add cycles.

add(x, id1) → remove(x, id1)
add(x, id2) → x still present

collaborative editing, presence systems

Sequence CRDT

RGA / LSEQ (TEXT EDITING)

Positions in sequence assigned unique IDs. Insert/delete at ID works regardless of concurrent operations on same text. Used in collaborative editors.

Insert at pos-ID
Concurrent inserts merge correctly

Apple Notes, Figma, Notion, Google Docs (OT variant)

G-Counter CRDT — concrete implementationPYTHON

class GCounter:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.counts = [0] * num_nodes   # slot per node

    def increment(self):
        self.counts[self.node_id] += 1  # only increment OWN slot

    def value(self):
        return sum(self.counts)

    def merge(self, other):             # merge another replica
        self.counts = [max(a, b)
            for a, b in zip(self.counts, other.counts)]

# Simulation: 3 nodes
A = GCounter(0, 3); A.increment(); A.increment(); A.increment()  # [3,0,0]
B = GCounter(1, 3); B.increment(); B.increment()                 # [0,2,0]
C = GCounter(2, 3); C.increment()                                # [0,0,1]

A.merge(B); A.merge(C)  # [3,2,1] → value = 6
B.merge(A); B.merge(C)  # [3,2,1] → value = 6
# All replicas converge to 6 regardless of merge order ✓

DynamoDB Global Tables

AWS managed active-active — LWW conflict resolution, ~1s replication lag

DynamoDB Global Tables — what it does and doesn't handleAWS

// Architecture: table replicated across N AWS regions
// Each region: accepts reads AND writes independently
// Replication: asynchronous, bidirectional, ~1s lag
// Conflict resolution: LAST-WRITE-WINS (timestamp-based)

// GOOD USES (LWW acceptable):
✓ User sessions: {user_id → session_token, last_seen}
✓ User preferences: {user_id → theme, language, notifications}
✓ Shopping cart: {user_id → cart_items} (last write wins per user)
✓ Configuration flags: {feature_flag → enabled/disabled}

// BAD USES (LWW causes data loss):
✗ Financial balances: concurrent increments → one lost
  User in US adds $100, user in EU adds $50 simultaneously
  One write wins → balance shows +$100 OR +$50, not +$150
✗ Inventory: concurrent decrements → overselling
✗ Leaderboard rankings: concurrent score updates

// For counters: use DynamoDB Streams + Lambda to consolidate
// Or: use a CRDT service (PN-Counter semantics)
// Or: route all writes for a given key to its "home" region

// Cost: each additional replica region ≈ 2× storage + throughput costs
// Replication lag: typically ~1s, can spike to ~5s under high load

CockroachDB Multi-Region

Designed for multi-region from the ground up — regional row placement, non-voting replicas

Table locality modes — the key CockroachDB multi-region conceptCOCKROACHDB SQL

-- Set survival goal: lose an entire region without data loss
ALTER DATABASE mydb SURVIVE REGION FAILURE;

-- REGIONAL BY ROW: each row pinned to its home region
-- EU user's rows stored in EU region → EU reads/writes at 5ms latency
ALTER TABLE users SET LOCALITY REGIONAL BY ROW AS region;

-- User in EU: crdb_region='eu-west-1' → row stored in EU → fast local access
INSERT INTO users (id, name, crdb_region) VALUES (1, 'Alice', 'eu-west-1');

-- REGIONAL TABLE: entire table anchored to one region
-- Good for: tables only accessed from one region
ALTER TABLE eu_compliance_log SET LOCALITY REGIONAL IN 'eu-west-1';

-- GLOBAL: replicated everywhere, optimized for reads everywhere
-- Writes need global consensus (slower), reads are local (fast)
-- Good for: product catalog, reference data, configuration
ALTER TABLE product_catalog SET LOCALITY GLOBAL;

-- Follower reads: slightly stale (~4.8s) but instant from any region
-- Good for: analytics, dashboards, non-critical reads
SELECT * FROM orders AS OF SYSTEM TIME follower_read_timestamp()
WHERE user_id = 123;

Non-voting replicas: Standard Raft requires write quorum. If 2 voting replicas are in US-East and 1 is in EU-West, every write must wait for the EU-West ACK (~85ms). Non-voting replicas participate in reads but NOT in write quorum — so US-East can commit with local quorum (~1ms) while EU-West asynchronously catches up for local reads.

GDPR & Data Residency

EU personal data must stay within EU — five concrete architecture implications

RPO & RTO Design

Define requirements first — then derive replication strategy

// RPO — RECOVERY POINT OBJECTIVE (how much data loss is acceptable?)

RPO = 0

Sync replication

RPO = seconds

Async replication

RPO = minutes

Periodic backups

// RTO — RECOVERY TIME OBJECTIVE (how fast must the system recover?)

RTO < 30s

Hot standby + auto failover

RTO = minutes

Warm standby + semi-auto

RTO = hours

Cold backup + manual

Interview formula: SLA → RTO/RPO → replication strategyDECISION LOGIC

// Given: 99.99% availability SLA (52 min/year downtime budget)
// Incident: detect=15min, page team=5min, fix=30min → 50 min per incident
// → Budget allows ZERO incidents that go over 52 min/year
// → Need automatic failover with RTO < 30s

// Payment service (financial data):
RPO = 0      → synchronous replication  (zero data loss)
RTO = 30s    → hot standby, auto-promote (no manual steps)
Cost:         higher write latency, expensive hot standby

// Analytics service (aggregate counts):
RPO = minutes → periodic snapshots         (some loss OK)
RTO = hours   → restore from snapshot      (downtime OK)
Cost:         cheap backup storage, no standby infra

// Product catalog (semi-static data):
RPO = seconds → async replication           (small loss OK)
RTO = minutes → warm standby, semi-auto    (brief outage OK)
Cost:         moderate — warm standby only

1

Latency Math Across Region Topologies

~1 hr

›

Active-active Raft cluster: US-East (leader), EU-West, AP-Southeast. Client writes from US-East. What is the minimum write latency? Show the RTT math.
Same cluster but AP-Southeast is non-voting. Now what is the write latency? What does AP-Southeast contribute?
Add SA-East (~90ms from US-East) as a 4th non-voting replica. Does write latency change? Does read latency for SA-East users change?
Tokyo user on active-passive system (US-East is primary, AP-Southeast has read replica). RTT for a write? RTT for a read? What SLA is achievable for Tokyo reads vs writes?

2

PN-Counter CRDT Implementation

~1.5 hrs

›

Implement PNCounter(nodeId, numNodes) with increment(), decrement(), merge(other), and value()
Test: Node A increments 3×, Node B increments 5× and decrements 2×. Merge A→B and B→A. Verify both show value=6.
Test idempotency: merge A into B twice. Does the result change?
Test commutativity: merge(A,B) == merge(B,A)?
Why can a regular integer counter NOT be a CRDT? Which of the three properties (commutative, associative, idempotent) does it violate?

3

GDPR Architecture for Twitter Feed (B6)

~2 hrs

›

List all personal data fields in the B6 Twitter design. Which must be GDPR-protected?
Design regional partitioning: can an EU user's tweet be cached in a US Redis instance? Can it be in a US CDN edge node?
User exercises right to erasure. Step-by-step deletion from: Cassandra (with TTR replicas), Redis timelines, ClickHouse analytics, Kafka topic logs.
Design cryptographic erasure: what is the KMS key structure? Who manages keys? How long does deletion take to propagate?
EU user's timeline includes tweets from a US user. Is including those tweets in EU storage a GDPR violation?

★

Multi-Region E-Commerce (India + EU)

~3 hrs

›

50M India users, 20M EU users. GDPR applies to EU. Products in both India and EU warehouses.

Product catalog (read-heavy, non-personal): which replication strategy? Active-active LWW? Read replicas? GLOBAL locality? Justify.
Inventory (globally shared — 1 unit in Bangalore can be bought by India or EU user): how do you prevent overselling without a global lock?
Orders: EU orders must stay in EU (GDPR). India user buys from EU warehouse — where does the order record live?
RPO/RTO for each service: inventory RPO=0/RTO=30s, catalog RPO=minutes/RTO=hours, orders RPO=0/RTO=60s. Design the specific replication for each.
Right to erasure for an EU user who has orders, reviews, and browsing history. What gets deleted? What can be retained (anonymized)?

0 / 20 completedMODULE C2 · GEO-DISTRIBUTION

Three reasons for multi-region: availability, latency, compliance

RTT numbers: US↔EU ~85ms, US↔APAC ~140ms, cross-AZ ~1ms

Sync global Raft → write latency = furthest region RTT (85–140ms)

Non-voting replicas: local write quorum + async remote replication

Active-passive: RPO (sync=0, async=seconds), RTO (auto=30s, manual=minutes)

Active-active conflict strategies: LWW, vector clocks, CRDTs, OT

CRDT properties: commutative + associative + idempotent merge

G-Counter: vector per node, merge=max per slot, value=sum

PN-Counter: two G-Counters, value = P - N

G-Set, 2P-Set, OR-Set — merge rules and use cases

Sequence CRDTs (RGA/LSEQ) for collaborative text editing

DynamoDB Global Tables: LWW active-active, ~1s lag, bad for counters

CockroachDB: REGIONAL BY ROW, GLOBAL, non-voting replicas, follower reads

GDPR: data classification, regional partitioning, geo-routing

GDPR right to erasure: cryptographic erasure via KMS key deletion

GDPR third parties: hash/pseudonymize before sending to log aggregators

RPO/RTO formula: SLA → downtime budget → RTO → replication strategy

✏️ Task 1: latency math across region topologies

✏️ Task 2: PN-Counter CRDT implementation + property verification

✏️ Task 4 (capstone): multi-region e-commerce India + EU

// NEXT MODULE

C3 — ML Systems Design

      Feature stores · Training pipelines · Model serving infrastructure

      A/B testing at scale · Shadow mode deployment · Feedback loops

      Real-time vs batch inference · Model versioning · Embeddings at scale