Track B · HLD · Module B2 · Week 12

Databases
at Scale

Indexing · ACID vs BASE · SQL vs NoSQL
Replication · Sharding · Read Replicas · Selection Guide
7
TOPICS
4
TASKS
5
NOSQL MODELS
B2
MODULE
B-Tree Index
ACID / BASE
SQL vs NoSQL
Replication
Sharding
Hotkey
DB Selection
Indexing
Why queries go from O(n) to O(log n) — and what it costs
B-Tree Index
Balanced search tree of (key → row pointer) pairs. Keys sorted → enables range queries, ORDER BY, binary search in O(log n).
USE: equality (=), range (<, >, BETWEEN), ORDER BY, JOIN ON
DEFAULT: PostgreSQL, MySQL, Oracle
Hash Index
Hash map of key → row pointer. O(1) exact lookup. Keys unordered — range queries impossible.
USE: equality only (=). Never for ranges.
EXAMPLES: Memcached, Redis, MySQL MEMORY engine
Composite Index
Index on (col_A, col_B). Left-most prefix rule: queries must start with col_A to use the index. Most selective column goes first.
INDEX(user_id, created_at)
✓ WHERE user_id=5
✓ WHERE user_id=5 AND created_at>X
✗ WHERE created_at>X (skips prefix)
Covering Index — query satisfied entirely from indexSQL
-- Table: orders (10M rows)
-- Query: SELECT user_id, created_at FROM orders WHERE user_id = 5

-- Regular index on user_id:
--   1. B-tree lookup → row pointer
--   2. Fetch row from heap (random disk I/O) to read created_at

-- Covering index on (user_id, created_at):
CREATE INDEX idx_covering ON orders(user_id, created_at);
--   1. B-tree lookup → both columns found IN the index leaf
--   2. NO heap access at all → drastically reduced I/O
--   ✅ PostgreSQL calls this "Index Only Scan"

-- Index design for common queries:
CREATE INDEX idx_user   ON orders(user_id);                  -- Q1: orders by user
CREATE INDEX idx_rest_t ON orders(restaurant_id, created_at); -- Q2: restaurant + time range
CREATE INDEX idx_status ON orders(status) WHERE status = 'pending'; -- Q3: partial index!
CREATE INDEX idx_user_d ON orders(user_id, created_at);      -- Q4: user + date range
Index cost trade-off: Every write (INSERT/UPDATE/DELETE) must update all indexes on that table. A table with 10 indexes needs 10 B-tree updates per write. Index only columns used in WHERE, JOIN ON, ORDER BY. Low-cardinality columns (boolean, status with 3 values) often give poor selectivity — the optimizer may prefer a full scan.
ACID vs BASE
The consistency guarantee spectrum — choose based on use case
ACID
RELATIONAL DATABASES · STRONG GUARANTEES
A
Atomicity — Transaction is all-or-nothing. Transfer $100: if debit succeeds but credit fails, both are rolled back. No partial writes survive.
C
Consistency — DB transitions from one valid state to another. Constraints (FK, UNIQUE, NOT NULL, CHECK) always satisfied post-transaction.
I
Isolation — Concurrent transactions don't see each other's in-progress writes. Levels: READ COMMITTED (default PG) → REPEATABLE READ → SERIALIZABLE.
D
Durability — Committed transactions survive crashes. Guaranteed via WAL (write-ahead log) + fsync to disk before confirming commit.
BASE
NOSQL / DISTRIBUTED · AVAILABLE BY DESIGN
B
Basically Available — System always responds, even if response is partial or stale. Rejects requests only in extreme failure, not during replication lag.
S
Soft State — System state can change over time even without new inputs. Replicas catching up, caches expiring, tombstones propagating.
E
Eventually Consistent — Given no new updates, all replicas converge to the same state. No guarantee on when. Reads may return stale data.
BASE is not "broken ACID" — it's a deliberate design choice.
Cassandra, DynamoDB, CouchDB are BASE by design.
Trade: strong consistency → availability + throughput
Isolation Levels Deep-Dive
ISOLATION LEVELDIRTY READNON-REPEATABLEPHANTOM READDEFAULT IN
READ UNCOMMITTEDAllowedAllowedAllowed
READ COMMITTEDPreventedAllowedAllowedPostgreSQL
REPEATABLE READPreventedPreventedAllowedMySQL InnoDB
SERIALIZABLEPreventedPreventedPreventedHighest isolation
Interview pattern: Higher isolation level = fewer anomalies but more locking = lower throughput. Most applications run at READ COMMITTED. SERIALIZABLE is used for financial systems where phantom reads would cause incorrect calculations (e.g., calculating remaining inventory before inserting an order).
SQL vs NoSQL
Not a religion — a trade-off based on access patterns and consistency needs
DIMENSIONSQL (Relational)NoSQL (Distributed)
Data modelTables, rows, strict schema, FK constraintsDocuments, KV, wide-column, graph — flexible
Query languageSQL — JOIN, GROUP BY, window functionsSimple API: Get/Put/Scan or limited query DSL
ConsistencyACID transactions guaranteedBASE; tunable per-operation (Cassandra)
ScalingVertical + read replicas; sharding is hardHorizontal sharding built in (designed for it)
Use whenComplex queries, JOINs, transactions, stable schemaKnown simple access patterns, massive scale, flexible schema
ExamplesPostgreSQL, MySQL, Aurora, SQLiteDynamoDB, Cassandra, MongoDB, Redis, Neo4j
The 5 NoSQL Data Models
Key-Value
SIMPLEST · FASTEST
Hash map: key → opaque value (string, binary, JSON). O(1) get/put. No query on value internals.
Redis, DynamoDB, Riak
Sessions, caching, shopping carts
Document
FLEXIBLE SCHEMA
Nested JSON/BSON. Query any field. Schema per document. Partial updates. No joins across documents.
MongoDB, CouchDB, Firestore
Catalogs, user profiles, CMS
Wide-Column
WRITE-HEAVY · TIME-SERIES
Row key → sorted map of column families. Sparse columns. Efficient range scans on row key. Designed for high-write workloads.
Cassandra, HBase, Bigtable
IoT telemetry, analytics, messaging
Graph
RELATIONSHIPS FIRST
Nodes (entities) + edges (relationships) + properties. Traverse relationships in O(1) per hop (not O(n) JOIN).
Neo4j, Amazon Neptune, Dgraph
Social networks, fraud detection, recommendations
Time-Series
TIMESTAMP OPTIMISED
Timestamp + measurement pairs. Optimised for time-range queries, aggregations, and downsampling. Automatic compression.
InfluxDB, TimescaleDB, Prometheus
Metrics, monitoring, IoT, financial ticks
Replication
How to survive node failures and serve more reads
// PRIMARY-REPLICA (LEADER-FOLLOWER)
WRITE
CLIENT
──→
PRIMARY
(accepts writes)
──async──→
REPLICA 1
←── READ CLIENT
──async──→
REPLICA 2
←── READ CLIENT
Scales reads horizontally (add more replicas)   Replication lag: reads from replica may be stale   Primary is still single write point
Replication Lag — Solutions
Strategies to handle stale readsPATTERNS
Problem: User updates profile → reads from replica → sees old data (lag ~100ms–2s)

Strategy 1: Read-your-own-writes
  Route user's reads to PRIMARY for their own data only.
  How: track last_write_time per user; if recent → route to primary.
  Cost: extra load on primary for the write author's reads.

Strategy 2: Monotonic Reads
  Always route same user to same replica.
  Prevents user seeing data "go backwards" (newer on one replica, older on next).
  How: Hash(userId) % numReplicas → sticky routing.

Strategy 3: Semi-Synchronous Replication
  Primary waits for ACK from at least 1 replica before confirming write.
  Zero data loss on primary crash (at least 1 replica has the write).
  Cost: write latency += 1 network RTT to replica.

Strategy 4: Route critical paths to primary
  Payment confirmation, inventory check → always read from primary.
  Profile photos, comment counts → can read from replica.
Multi-Primary (Active-Active) Replication
ASPECTPRIMARY-REPLICAMULTI-PRIMARY
Write acceptanceOnly primary accepts writesAny node accepts writes
Write scalingLimited to primary capacityWrites scale across all nodes
Conflict riskNone (single writer)High — concurrent writes to same row
Conflict resolutionN/ALWW timestamp, CRDTs, app-level merge
Use caseRead-heavy web appsGeo-distributed active-active, Cassandra
Sharding (Horizontal Partitioning)
Distributing rows across multiple DB nodes — when one machine isn't enough
Range-Based
ORDERED · SIMPLE
Partition by value range of shard key.

Shard 0: user_id 1–25M
Shard 1: user_id 25M–50M
Shard 2: user_id 50M–75M
✓ Range queries on one or few shards
✓ Easy to add new shards at tail
✗ Hot spots if recent data is most active
✗ Uneven load if data is skewed
Hash-Based
EVEN DISTRIBUTION
shard = hash(key) % N

user_12345 → hash(12345)%4 → Shard 2
Uniformly distributes keys regardless of value patterns.
✓ Even distribution, no hot spots
✓ Simple deterministic routing
✗ Range queries → scatter-gather all shards
✗ Resharding remaps all keys (painful)
Consistent Hashing
MINIMAL REHASH · RING
Map servers + keys to same hash ring (0→2³²). Each key → nearest server clockwise.

Add/remove server → only K/N keys remapped (vs N times more with mod hash).
✓ Adding/removing nodes is non-disruptive
✓ Virtual nodes for even distribution
✗ More complex to implement
✗ Still no range query support
The Hotkey Problem: A single shard key (Taylor Swift's user_id during a concert drop) gets 100× more traffic than others. Solutions: (1) Key suffixing — distribute post_id_0...post_id_9 across shards; (2) Cache in Redis; (3) Per-shard read replicas. Design your shard key to co-locate related data and distribute write load.
Cross-Shard Query Problem
Strategies for queries that span shardsPATTERNS
Problem: SELECT COUNT(*) FROM orders GROUP BY restaurant_id
         orders are sharded by user_id → restaurant data is spread across all shards
         → Must query ALL shards and merge results in application layer

Solution 1: Denormalise (NoSQL pattern)
  Embed related data in the same document.
  User document includes their recent orders → no cross-shard JOIN needed.

Solution 2: Co-locate by access pattern
  Shard ALL of a user's data by user_id.
  Query "all my orders" stays on one shard.
  But "all orders for this restaurant" still requires scatter-gather.

Solution 3: Separate analytics store
  Write events to Kafka → consume into data warehouse (BigQuery, Redshift).
  Cross-tenant/cross-shard analytics run on the warehouse, not the OLTP DB.

Solution 4: Accept scatter-gather for rare queries
  Fan out to all shards, merge in application layer, cache the result aggressively.
Database Selection Guide
The right database for the right job — based on access patterns, not familiarity
Need ACID + complex JOINs?
PostgreSQL / MySQL / Aurora
Need massive write throughput (millions/sec)?
Cassandra / DynamoDB
Need flexible JSON schema per document?
MongoDB / CouchDB / Firestore
Need graph traversal (friends-of-friends)?
Neo4j / Amazon Neptune
Need time-series metrics + monitoring?
InfluxDB / TimescaleDB / Prometheus
Need full-text search + facets?
Elasticsearch / OpenSearch / Algolia
Need distributed SQL (geo-global)?
CockroachDB / Google Spanner
Need in-memory cache / pub-sub / leaderboard?
Redis / Memcached
Comparison Matrix
DATABASETYPECONSISTENCYSCALEBEST FOR
PostgreSQLRelationalACIDVertical + read replicasComplex queries, transactions, analytics
MySQL / AuroraRelationalACIDVertical + read replicasWeb apps, proven reliability, Aurora serverless
MongoDBDocumentTunableHorizontal shardingFlexible schema, catalogs, user profiles
CassandraWide-columnEventual / TunableHorizontal (massive)High-write IoT, time-series, messaging
DynamoDBKV / DocumentEventual / StrongHorizontal (managed)Serverless, variable load, simple access patterns
RedisKey-ValueStrong (single node)Cluster modeCache, sessions, pub/sub, sorted sets
ElasticsearchSearchEventualHorizontalFull-text search, log analysis, facets
InfluxDBTime-seriesEventualHorizontalMetrics, monitoring, IoT telemetry
Interview Tips
QUESTIONSTRONG ANSWER
"SQL or NoSQL?"Depends on access patterns. Complex JOINs + ACID → SQL. Known simple key lookups + horizontal scale → NoSQL. Tell me the access patterns first.
"How does B-tree indexing work?"Balanced tree keeps keys sorted. Binary search on keys = O(log n). Range queries traverse adjacent leaves. Cost: every write updates all indexes.
"How to handle a hotkey?"Cache in Redis (avoid DB entirely), key suffixing (distribute load across N keys), or per-shard read replicas for the hot shard.
"Primary vs replica reads?"Replica reads are faster but may be stale by replication lag. Use primary for user's own writes (read-your-writes) and financial data. Replica is fine for other users' public data.
"ACID vs BASE?"ACID: all-or-nothing transactions, strong consistency — right for banking, booking. BASE: available, eventually consistent — right for social likes, analytics.
01
Index Design — Orders Table
~1.5 hrs
CREATE TABLE orders (
    id            BIGINT PRIMARY KEY,
    user_id       INT,
    restaurant_id INT,
    status        VARCHAR(20),  -- 'pending', 'delivered', 'cancelled'
    total         DECIMAL(10,2),
    created_at    TIMESTAMP
);

Query patterns:
  Q1: All orders for a specific user (most frequent)
  Q2: Recent orders for a restaurant ordered by time
  Q3: All pending orders (dashboard refresh every 5s)
  Q4: Orders by user in a date range

For each query:
  1. Design the optimal index (name columns + order)
  2. Explain the B-tree traversal path
  3. Identify if a covering index is possible
  4. Explain the INSERT/UPDATE cost of your index choice
  Bonus: Q3 — would a partial index help? Why?
02
Sharding Design — Multi-Tenant SaaS
~1.5 hrs

10,000 business tenants, each with 10K–1M records. Most queries scoped to single tenant. 3 large tenants = 60% of traffic. Occasional cross-tenant analytics.

  1. What is the shard key? Why?
  2. Which sharding strategy: range, hash, or consistent hashing? Why?
  3. How do you handle the 3 hot tenants without letting them overload one shard?
  4. How do you serve cross-tenant analytics without scatter-gathering all shards?
  5. What happens when a tenant grows 10× — does your design need changes?
03
Replication Lag Decision Scenarios
~1 hr

Social media platform: primary DB for writes, 3 read replicas, replication lag 100ms–2s.

For each, decide: primary or replica? Justify with the consistency model needed.

  1. User views their own profile immediately after updating it
  2. User views another user's follower count
  3. User's home feed (posts from people they follow)
  4. Payment confirmation page after completing a purchase
  5. Admin dashboard showing total active users (refreshed every 5 min)
Design Instagram's Storage Layer
~3 hrs · full HLD

Design complete storage for Instagram. For each data type: DB choice, CAP position, sharding key, storage estimate.

  • User accounts (500M users)
  • Photos/Videos (100B media files, avg 3 MB)
  • Comments (10B comments)
  • Likes (500B likes — read-heavy, approximate count OK)
  • Follow relationships (avg 500 followers/user)
  • Feed (posts from people you follow — fan-out on write vs read)

Required: draw the storage architecture diagram, estimate total storage, identify the hardest consistency challenge.

0 / 14 completedMODULE B2 · DATABASES AT SCALE
B-tree indexing: O(log n), left-most prefix rule, range query support
Composite and covering indexes — when and how to design them
ACID — all 4 properties in depth, including isolation levels
BASE — deliberate trade-off, when it's the right choice
SQL vs NoSQL decision: based on access patterns, not preference
All 5 NoSQL data models: KV, Document, Wide-column, Graph, Time-series
Primary-replica replication + replication lag solutions (4 strategies)
Multi-primary replication — conflict resolution strategies
Range vs hash vs consistent hashing sharding — trade-offs of each
Hotkey problem — 3 solutions: cache, key suffixing, read replicas per shard
Cross-shard query problem and solutions (denorm, co-locate, analytics store)
Can pick the right DB for any scenario in the selection guide
✏️ Tasks 1–3 completed (index design, sharding, replication lag)
✏️ Capstone: Instagram storage layer — full design with estimates
// NEXT MODULE
B3 — Caching
Redis data structures · CDN · Cache-aside · Write-through · Write-back
Cache invalidation strategies · TTL vs eviction · Cache stampede
Read-through cache · Consistent hashing for distributed cache
← B1: HLD Fundamentals 📄 Full Notes ↑ Roadmap B3: Caching →