SYSTEM DESIGN MASTERY · TRACK C · MODULE C2 · WEEK 26 GEO-DISTRIBUTION · ACTIVE-ACTIVE · CRDTS · GDPR · RPO/RTO
Advanced Distributed Systems · Multi-Region · Data Residency

Geo‑Distribution
& Multi‑Region
Architecture

ACTIVE-PASSIVE · ACTIVE-ACTIVE · CRDTs
DYNAMODB GLOBAL · COCKROACHDB · GDPR · RPO/RTO
~85ms
US ↔ EU RTT
7
CRDT TYPES
RPO=0
SYNC REPLICATION
C2
MODULE
Speed of Light
Active-Passive
Active-Active
CRDTs
GDPR
DynamoDB Global
CockroachDB
RPO / RTO
Why Go Multi-Region?
Three distinct motivations — each requires a different solution
Availability
SURVIVE REGION OUTAGE
AWS us-east-1 has gone down multiple times (2017, 2021, 2023). A single-region system goes down with it.
Solution:
Active-passive failover
Active-active with replication
Latency
SERVE NEARBY USERS
Tokyo user hitting us-east-1: ~140ms. Tokyo user hitting ap-northeast-1: ~5ms. For writes: 28× difference.
Solution:
Read replicas in each region
Regional active-active writes
Compliance
DATA RESIDENCY LAWS
GDPR (EU), India PDPB, China PIPL all require personal data to remain within borders. Violating = large fines.
Solution:
Regional data partitioning
Geo-routing + KMS per region
Interview clarifying question: "Before I design the multi-region architecture — which of these are we solving? Availability requires hot standby and automatic failover. Latency requires regional replicas or active-active. Compliance requires strict data partitioning and geo-routing. They have very different solutions and cost profiles."
Speed of Light
The unavoidable physical constraint — these numbers drive every multi-region decision
// REAL-WORLD ROUND-TRIP TIMES (approximate)
Within AZ
~0.5ms
baseline
Cross-AZ
~1–2ms
fast
US-East ↔ US-West
~70ms
noticeable
US-East ↔ EU-West
~85ms
slow
US-East ↔ APAC
~140ms
very slow
EU-West ↔ APAC
~130ms
very slow
EU-West ↔ EU-Central
~20ms
ok
Within APAC
~50–80ms
varies
Latency math: why synchronous global consensus is impracticalCALCULATION
// Raft cluster: US-East (leader), EU-West, AP-Southeast
// Quorum = 2 of 3. Client writes from US-East.

Write latency = time until quorum ACK
  US-East → EU-West round trip:  ~85ms
  US-East → AP-Southeast RT:     ~140ms

Quorum write latency = min(EU-West RTT, AP-Southeast RTT) = ~85ms
// must wait for 1 of 2 remote nodes to ACK

// This means every write to your database takes 85ms minimum.
// A 10ms SLA is IMPOSSIBLE with synchronous global Raft.

// Solution: non-voting replicas (CockroachDB style)
// US-East: 2 voting replicas (quorum = 2, all local → ~1ms write latency)
// EU-West: 1 non-voting replica (async replication, ~1s lag)
// AP-Southeast: 1 non-voting replica (async replication, ~1s lag)
// Write latency: ~1ms (local quorum) ✓
// EU/APAC reads: served locally (slightly stale) ✓
Active-Passive vs Active-Active
The fundamental choice — consistency vs availability under partition
Active-Passive
ONE PRIMARY · N STANDBYS
Primary serves all reads and writes. Standby(s) replicate from primary but serve nothing normally. On primary failure: standby is promoted.
Sync replication: primary waits for standby ACK before committing. RPO=0, but adds cross-region latency to every write.
Async replication: primary commits immediately, replication in background. RPO=seconds, no write latency penalty. Standard choice.
✓ Simple to reason about
✓ No conflict resolution needed
✓ Strong consistency always
✗ All writes go to one region
✗ Remote users pay write latency
✗ Failover gap: 30s–minutes
✗ Risk: split-brain on failover
Active-Active
ALL REGIONS READ + WRITE
Multiple regions all serve reads AND writes. No primary. All regions are peers. Writes in different regions must converge.
Conflict resolution required: Last-Write-Wins (LWW), vector clocks, CRDTs, or Operational Transform. Each has different trade-offs.
Eventually consistent: after a write in EU, the US region may be ~1s behind. Briefly inconsistent across regions.
✓ Low write latency everywhere
✓ Survives full region loss
✓ No bottleneck primary region
✗ Conflict resolution complexity
✗ Eventually consistent
✗ Some conflicts = silent data loss (LWW)
✗ Harder to debug
CONFLICT STRATEGYHOW IT WORKSDATA LOSS?BEST FOR
Last-Write-WinsHighest timestamp wins on conflictYes — concurrent write silently lostSessions, caches, preferences
Vector ClocksTrack causality; surface conflicts to appNo — app resolves explicitlyShopping carts (Dynamo original)
CRDTsMathematically conflict-free mergeNever — by designCounters, sets, collaborative editing
Operational TransformTransform ops against concurrent opsNever — but complexReal-time collaborative text (Google Docs)
CRDTs — Conflict-Free Replicated Data Types
Mathematical guarantee: any two replicas merged in any order → same result
The three properties: merge must be Commutative (order doesn't matter), Associative (grouping doesn't matter), Idempotent (merging twice = merging once). Any data structure satisfying these three is a CRDT — concurrent updates across replicas always converge.
G-Counter
GROW-ONLY COUNTER
Vector of counts, one slot per node. Node i only increments slot i. Merge = max of each slot. Value = sum all slots.
State: [3, 5, 2]
Merge: max per slot
Value: sum = 10
view counts, like counts, event totals
PN-Counter
INCREMENT + DECREMENT
Two G-Counters: P (increments) and N (decrements). Value = sum(P) - sum(N). Allows decrement while remaining CRDT.
P: [3,5,2] N: [1,2,0]
Value = 10 - 3 = 7
shopping cart quantities, inventory approximation
G-Set
GROW-ONLY SET
Elements can only be added, never removed. Merge = union of both sets. Trivially conflict-free.
A: {x,y} B: {y,z}
Merge: {x,y,z}
tag sets, immutable membership lists
2P-Set
ADD + PERMANENT REMOVE
Two G-Sets: Add-set A and Remove-set R. Element present if: in A AND NOT in R. Once removed, cannot re-add.
Present = A \ R
Remove is permanent
when re-add not needed: banned users, archived items
OR-Set
OBSERVED-REMOVE SET
Each add tagged with unique ID. Remove only removes elements with that specific tag. Allows correct add-remove-add cycles.
add(x, id1) → remove(x, id1)
add(x, id2) → x still present
collaborative editing, presence systems
Sequence CRDT
RGA / LSEQ (TEXT EDITING)
Positions in sequence assigned unique IDs. Insert/delete at ID works regardless of concurrent operations on same text. Used in collaborative editors.
Insert at pos-ID
Concurrent inserts merge correctly
Apple Notes, Figma, Notion, Google Docs (OT variant)
G-Counter CRDT — concrete implementationPYTHON
class GCounter:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.counts = [0] * num_nodes   # slot per node

    def increment(self):
        self.counts[self.node_id] += 1  # only increment OWN slot

    def value(self):
        return sum(self.counts)

    def merge(self, other):             # merge another replica
        self.counts = [max(a, b)
            for a, b in zip(self.counts, other.counts)]

# Simulation: 3 nodes
A = GCounter(0, 3); A.increment(); A.increment(); A.increment()  # [3,0,0]
B = GCounter(1, 3); B.increment(); B.increment()                 # [0,2,0]
C = GCounter(2, 3); C.increment()                                # [0,0,1]

A.merge(B); A.merge(C)  # [3,2,1] → value = 6
B.merge(A); B.merge(C)  # [3,2,1] → value = 6
# All replicas converge to 6 regardless of merge order ✓
DynamoDB Global Tables
AWS managed active-active — LWW conflict resolution, ~1s replication lag
DynamoDB Global Tables — what it does and doesn't handleAWS
// Architecture: table replicated across N AWS regions
// Each region: accepts reads AND writes independently
// Replication: asynchronous, bidirectional, ~1s lag
// Conflict resolution: LAST-WRITE-WINS (timestamp-based)

// GOOD USES (LWW acceptable):
 User sessions: {user_id → session_token, last_seen}
 User preferences: {user_id → theme, language, notifications}
 Shopping cart: {user_id → cart_items} (last write wins per user)
 Configuration flags: {feature_flag → enabled/disabled}

// BAD USES (LWW causes data loss):
 Financial balances: concurrent increments → one lost
  User in US adds $100, user in EU adds $50 simultaneously
  One write wins → balance shows +$100 OR +$50, not +$150
 Inventory: concurrent decrements → overselling
 Leaderboard rankings: concurrent score updates

// For counters: use DynamoDB Streams + Lambda to consolidate
// Or: use a CRDT service (PN-Counter semantics)
// Or: route all writes for a given key to its "home" region

// Cost: each additional replica region ≈ 2× storage + throughput costs
// Replication lag: typically ~1s, can spike to ~5s under high load
CockroachDB Multi-Region
Designed for multi-region from the ground up — regional row placement, non-voting replicas
Table locality modes — the key CockroachDB multi-region conceptCOCKROACHDB SQL
-- Set survival goal: lose an entire region without data loss
ALTER DATABASE mydb SURVIVE REGION FAILURE;

-- REGIONAL BY ROW: each row pinned to its home region
-- EU user's rows stored in EU region → EU reads/writes at 5ms latency
ALTER TABLE users SET LOCALITY REGIONAL BY ROW AS region;

-- User in EU: crdb_region='eu-west-1' → row stored in EU → fast local access
INSERT INTO users (id, name, crdb_region) VALUES (1, 'Alice', 'eu-west-1');

-- REGIONAL TABLE: entire table anchored to one region
-- Good for: tables only accessed from one region
ALTER TABLE eu_compliance_log SET LOCALITY REGIONAL IN 'eu-west-1';

-- GLOBAL: replicated everywhere, optimized for reads everywhere
-- Writes need global consensus (slower), reads are local (fast)
-- Good for: product catalog, reference data, configuration
ALTER TABLE product_catalog SET LOCALITY GLOBAL;

-- Follower reads: slightly stale (~4.8s) but instant from any region
-- Good for: analytics, dashboards, non-critical reads
SELECT * FROM orders AS OF SYSTEM TIME follower_read_timestamp()
WHERE user_id = 123;
Non-voting replicas: Standard Raft requires write quorum. If 2 voting replicas are in US-East and 1 is in EU-West, every write must wait for the EU-West ACK (~85ms). Non-voting replicas participate in reads but NOT in write quorum — so US-East can commit with local quorum (~1ms) while EU-West asynchronously catches up for local reads.
GDPR & Data Residency
EU personal data must stay within EU — five concrete architecture implications
1
DATA CLASSIFICATION
Identify what is "personal data" under GDPR before designing storage. Name, email, IP address, location, device IDs, behavioral data, and any data that can identify a person.
Personal: name, email, ip_address, location, user_id
Non-personal: aggregated counts, anonymized analytics
2
REGIONAL PARTITIONING + GEO-ROUTING
EU users' personal data must be stored only in EU region. DNS geolocation routing (Route 53) sends EU requests to EU region. JWT contains region claim — services route accordingly.
Route53 geolocation: *.api.com → eu-west-1 (for EU IPs)
JWT: {"user_id": 123, "region": "eu-west-1"}
Service: read/write user data only from claimed region
3
NO CROSS-REGION REPLICATION OF PERSONAL DATA
EU personal data must NOT be replicated to the US region — not even for backups. Use separate S3 buckets, separate KMS keys, and separate DB clusters per region.
EU cluster: eu-west-1 only, KMS key: eu-west-1
US cluster: us-east-1 only, KMS key: us-east-1
Backups: encrypted with regional KMS → stays in region
4
RIGHT TO ERASURE — CRYPTOGRAPHIC DELETION
User requests deletion. Must erase from ALL replicas, caches, CDN, and backups. Cryptographic erasure: encrypt all user data with a per-user key stored in KMS. Deletion = delete the KMS key. All copies become unreadable without re-encryption.
Store: encrypt(user_data, kms_key_id=user_123)
Delete: KMS.deleteKey(user_123)
Result: all stored blobs are now unreadable garbage
5
THIRD-PARTY SERVICES (LOGS, MONITORING)
Datadog, Splunk, and other log aggregators that receive EU personal data must have Data Processing Agreements (DPAs) and EU data residency configured. Never log raw EU personal data — hash or pseudonymize before sending to third parties.
Bad: log("user login: email=alice@eu.com ip=1.2.3.4")
Good: log("user login: user_id=hash(alice) region=eu")
RPO & RTO Design
Define requirements first — then derive replication strategy
// RPO — RECOVERY POINT OBJECTIVE (how much data loss is acceptable?)
RPO = 0
Sync replication
RPO = seconds
Async replication
RPO = minutes
Periodic backups
// RTO — RECOVERY TIME OBJECTIVE (how fast must the system recover?)
RTO < 30s
Hot standby + auto failover
RTO = minutes
Warm standby + semi-auto
RTO = hours
Cold backup + manual
Interview formula: SLA → RTO/RPO → replication strategyDECISION LOGIC
// Given: 99.99% availability SLA (52 min/year downtime budget)
// Incident: detect=15min, page team=5min, fix=30min → 50 min per incident
// → Budget allows ZERO incidents that go over 52 min/year
// → Need automatic failover with RTO < 30s

// Payment service (financial data):
RPO = 0      → synchronous replication  (zero data loss)
RTO = 30s    → hot standby, auto-promote (no manual steps)
Cost:         higher write latency, expensive hot standby

// Analytics service (aggregate counts):
RPO = minutes → periodic snapshots         (some loss OK)
RTO = hours   → restore from snapshot      (downtime OK)
Cost:         cheap backup storage, no standby infra

// Product catalog (semi-static data):
RPO = seconds → async replication           (small loss OK)
RTO = minutes → warm standby, semi-auto    (brief outage OK)
Cost:         moderate — warm standby only
1
Latency Math Across Region Topologies
~1 hr
  1. Active-active Raft cluster: US-East (leader), EU-West, AP-Southeast. Client writes from US-East. What is the minimum write latency? Show the RTT math.
  2. Same cluster but AP-Southeast is non-voting. Now what is the write latency? What does AP-Southeast contribute?
  3. Add SA-East (~90ms from US-East) as a 4th non-voting replica. Does write latency change? Does read latency for SA-East users change?
  4. Tokyo user on active-passive system (US-East is primary, AP-Southeast has read replica). RTT for a write? RTT for a read? What SLA is achievable for Tokyo reads vs writes?
2
PN-Counter CRDT Implementation
~1.5 hrs
  1. Implement PNCounter(nodeId, numNodes) with increment(), decrement(), merge(other), and value()
  2. Test: Node A increments 3×, Node B increments 5× and decrements 2×. Merge A→B and B→A. Verify both show value=6.
  3. Test idempotency: merge A into B twice. Does the result change?
  4. Test commutativity: merge(A,B) == merge(B,A)?
  5. Why can a regular integer counter NOT be a CRDT? Which of the three properties (commutative, associative, idempotent) does it violate?
3
GDPR Architecture for Twitter Feed (B6)
~2 hrs
  1. List all personal data fields in the B6 Twitter design. Which must be GDPR-protected?
  2. Design regional partitioning: can an EU user's tweet be cached in a US Redis instance? Can it be in a US CDN edge node?
  3. User exercises right to erasure. Step-by-step deletion from: Cassandra (with TTR replicas), Redis timelines, ClickHouse analytics, Kafka topic logs.
  4. Design cryptographic erasure: what is the KMS key structure? Who manages keys? How long does deletion take to propagate?
  5. EU user's timeline includes tweets from a US user. Is including those tweets in EU storage a GDPR violation?
Multi-Region E-Commerce (India + EU)
~3 hrs

50M India users, 20M EU users. GDPR applies to EU. Products in both India and EU warehouses.

  1. Product catalog (read-heavy, non-personal): which replication strategy? Active-active LWW? Read replicas? GLOBAL locality? Justify.
  2. Inventory (globally shared — 1 unit in Bangalore can be bought by India or EU user): how do you prevent overselling without a global lock?
  3. Orders: EU orders must stay in EU (GDPR). India user buys from EU warehouse — where does the order record live?
  4. RPO/RTO for each service: inventory RPO=0/RTO=30s, catalog RPO=minutes/RTO=hours, orders RPO=0/RTO=60s. Design the specific replication for each.
  5. Right to erasure for an EU user who has orders, reviews, and browsing history. What gets deleted? What can be retained (anonymized)?
0 / 20 completedMODULE C2 · GEO-DISTRIBUTION
Three reasons for multi-region: availability, latency, compliance
RTT numbers: US↔EU ~85ms, US↔APAC ~140ms, cross-AZ ~1ms
Sync global Raft → write latency = furthest region RTT (85–140ms)
Non-voting replicas: local write quorum + async remote replication
Active-passive: RPO (sync=0, async=seconds), RTO (auto=30s, manual=minutes)
Active-active conflict strategies: LWW, vector clocks, CRDTs, OT
CRDT properties: commutative + associative + idempotent merge
G-Counter: vector per node, merge=max per slot, value=sum
PN-Counter: two G-Counters, value = P - N
G-Set, 2P-Set, OR-Set — merge rules and use cases
Sequence CRDTs (RGA/LSEQ) for collaborative text editing
DynamoDB Global Tables: LWW active-active, ~1s lag, bad for counters
CockroachDB: REGIONAL BY ROW, GLOBAL, non-voting replicas, follower reads
GDPR: data classification, regional partitioning, geo-routing
GDPR right to erasure: cryptographic erasure via KMS key deletion
GDPR third parties: hash/pseudonymize before sending to log aggregators
RPO/RTO formula: SLA → downtime budget → RTO → replication strategy
✏️ Task 1: latency math across region topologies
✏️ Task 2: PN-Counter CRDT implementation + property verification
✏️ Task 4 (capstone): multi-region e-commerce India + EU
// NEXT MODULE
C3 — ML Systems Design
Feature stores · Training pipelines · Model serving infrastructure
A/B testing at scale · Shadow mode deployment · Feedback loops
Real-time vs batch inference · Model versioning · Embeddings at scale
← C1 Consensus 📄 Study Notes ↑ Roadmap C3 ML Systems →