Module B2 — Databases at Scale | HLD Track

Indexing

Why queries go from O(n) to O(log n) — and what it costs

B-Tree Index

Balanced search tree of (key → row pointer) pairs. Keys sorted → enables range queries, ORDER BY, binary search in O(log n).

USE: equality (=), range (<, >, BETWEEN), ORDER BY, JOIN ON
DEFAULT: PostgreSQL, MySQL, Oracle

Hash Index

Hash map of key → row pointer. O(1) exact lookup. Keys unordered — range queries impossible.

USE: equality only (=). Never for ranges.
EXAMPLES: Memcached, Redis, MySQL MEMORY engine

Composite Index

Index on (col_A, col_B). Left-most prefix rule: queries must start with col_A to use the index. Most selective column goes first.

INDEX(user_id, created_at)
✓ WHERE user_id=5
✓ WHERE user_id=5 AND created_at>X
✗ WHERE created_at>X (skips prefix)

Covering Index — query satisfied entirely from indexSQL

-- Table: orders (10M rows)
-- Query: SELECT user_id, created_at FROM orders WHERE user_id = 5

-- Regular index on user_id:
--   1. B-tree lookup → row pointer
--   2. Fetch row from heap (random disk I/O) to read created_at

-- Covering index on (user_id, created_at):
CREATE INDEX idx_covering ON orders(user_id, created_at);
--   1. B-tree lookup → both columns found IN the index leaf
--   2. NO heap access at all → drastically reduced I/O
--   ✅ PostgreSQL calls this "Index Only Scan"

-- Index design for common queries:
CREATE INDEX idx_user   ON orders(user_id);                  -- Q1: orders by user
CREATE INDEX idx_rest_t ON orders(restaurant_id, created_at); -- Q2: restaurant + time range
CREATE INDEX idx_status ON orders(status) WHERE status = 'pending'; -- Q3: partial index!
CREATE INDEX idx_user_d ON orders(user_id, created_at);      -- Q4: user + date range

Index cost trade-off: Every write (INSERT/UPDATE/DELETE) must update all indexes on that table. A table with 10 indexes needs 10 B-tree updates per write. Index only columns used in WHERE, JOIN ON, ORDER BY. Low-cardinality columns (boolean, status with 3 values) often give poor selectivity — the optimizer may prefer a full scan.

ACID vs BASE

The consistency guarantee spectrum — choose based on use case

ACID

RELATIONAL DATABASES · STRONG GUARANTEES

A

Atomicity — Transaction is all-or-nothing. Transfer $100: if debit succeeds but credit fails, both are rolled back. No partial writes survive.

C

Consistency — DB transitions from one valid state to another. Constraints (FK, UNIQUE, NOT NULL, CHECK) always satisfied post-transaction.

I

Isolation — Concurrent transactions don't see each other's in-progress writes. Levels: READ COMMITTED (default PG) → REPEATABLE READ → SERIALIZABLE.

D

Durability — Committed transactions survive crashes. Guaranteed via WAL (write-ahead log) + fsync to disk before confirming commit.

BASE

NOSQL / DISTRIBUTED · AVAILABLE BY DESIGN

B

Basically Available — System always responds, even if response is partial or stale. Rejects requests only in extreme failure, not during replication lag.

S

Soft State — System state can change over time even without new inputs. Replicas catching up, caches expiring, tombstones propagating.

E

Eventually Consistent — Given no new updates, all replicas converge to the same state. No guarantee on when. Reads may return stale data.

          BASE is not "broken ACID" — it's a deliberate design choice.

          Cassandra, DynamoDB, CouchDB are BASE by design.

          Trade: strong consistency → availability + throughput

Isolation Levels Deep-Dive

ISOLATION LEVEL	DIRTY READ	NON-REPEATABLE	PHANTOM READ	DEFAULT IN
READ UNCOMMITTED	Allowed	Allowed	Allowed	—
READ COMMITTED	Prevented	Allowed	Allowed	PostgreSQL
REPEATABLE READ	Prevented	Prevented	Allowed	MySQL InnoDB
SERIALIZABLE	Prevented	Prevented	Prevented	Highest isolation

Interview pattern: Higher isolation level = fewer anomalies but more locking = lower throughput. Most applications run at READ COMMITTED. SERIALIZABLE is used for financial systems where phantom reads would cause incorrect calculations (e.g., calculating remaining inventory before inserting an order).

SQL vs NoSQL

Not a religion — a trade-off based on access patterns and consistency needs

DIMENSION	SQL (Relational)	NoSQL (Distributed)
Data model	Tables, rows, strict schema, FK constraints	Documents, KV, wide-column, graph — flexible
Query language	SQL — JOIN, GROUP BY, window functions	Simple API: Get/Put/Scan or limited query DSL
Consistency	ACID transactions guaranteed	BASE; tunable per-operation (Cassandra)
Scaling	Vertical + read replicas; sharding is hard	Horizontal sharding built in (designed for it)
Use when	Complex queries, JOINs, transactions, stable schema	Known simple access patterns, massive scale, flexible schema
Examples	PostgreSQL, MySQL, Aurora, SQLite	DynamoDB, Cassandra, MongoDB, Redis, Neo4j

The 5 NoSQL Data Models

Key-Value

SIMPLEST · FASTEST

Hash map: key → opaque value (string, binary, JSON). O(1) get/put. No query on value internals.

Redis, DynamoDB, Riak
Sessions, caching, shopping carts

Document

FLEXIBLE SCHEMA

Nested JSON/BSON. Query any field. Schema per document. Partial updates. No joins across documents.

MongoDB, CouchDB, Firestore
Catalogs, user profiles, CMS

Wide-Column

WRITE-HEAVY · TIME-SERIES

Row key → sorted map of column families. Sparse columns. Efficient range scans on row key. Designed for high-write workloads.

Cassandra, HBase, Bigtable
IoT telemetry, analytics, messaging

Graph

RELATIONSHIPS FIRST

Nodes (entities) + edges (relationships) + properties. Traverse relationships in O(1) per hop (not O(n) JOIN).

Neo4j, Amazon Neptune, Dgraph
Social networks, fraud detection, recommendations

Time-Series

TIMESTAMP OPTIMISED

Timestamp + measurement pairs. Optimised for time-range queries, aggregations, and downsampling. Automatic compression.

InfluxDB, TimescaleDB, Prometheus
Metrics, monitoring, IoT, financial ticks

Replication

How to survive node failures and serve more reads

// PRIMARY-REPLICA (LEADER-FOLLOWER)

WRITE
CLIENT

──→

PRIMARY
(accepts writes)

──async──→

REPLICA 1

←── READ CLIENT

──async──→

REPLICA 2

←── READ CLIENT

      ✓ Scales reads horizontally (add more replicas)  
      ✗ Replication lag: reads from replica may be stale  
      ✗ Primary is still single write point
    

Replication Lag — Solutions

Strategies to handle stale readsPATTERNS

Problem: User updates profile → reads from replica → sees old data (lag ~100ms–2s)

Strategy 1: Read-your-own-writes
  Route user's reads to PRIMARY for their own data only.
  How: track last_write_time per user; if recent → route to primary.
  Cost: extra load on primary for the write author's reads.

Strategy 2: Monotonic Reads
  Always route same user to same replica.
  Prevents user seeing data "go backwards" (newer on one replica, older on next).
  How: Hash(userId) % numReplicas → sticky routing.

Strategy 3: Semi-Synchronous Replication
  Primary waits for ACK from at least 1 replica before confirming write.
  Zero data loss on primary crash (at least 1 replica has the write).
  Cost: write latency += 1 network RTT to replica.

Strategy 4: Route critical paths to primary
  Payment confirmation, inventory check → always read from primary.
  Profile photos, comment counts → can read from replica.

Multi-Primary (Active-Active) Replication

ASPECT	PRIMARY-REPLICA	MULTI-PRIMARY
Write acceptance	Only primary accepts writes	Any node accepts writes
Write scaling	Limited to primary capacity	Writes scale across all nodes
Conflict risk	None (single writer)	High — concurrent writes to same row
Conflict resolution	N/A	LWW timestamp, CRDTs, app-level merge
Use case	Read-heavy web apps	Geo-distributed active-active, Cassandra

Sharding (Horizontal Partitioning)

Distributing rows across multiple DB nodes — when one machine isn't enough

Range-Based

ORDERED · SIMPLE

Partition by value range of shard key.

Shard 0: user_id 1–25M
Shard 1: user_id 25M–50M
Shard 2: user_id 50M–75M

✓ Range queries on one or few shards
✓ Easy to add new shards at tail

✗ Hot spots if recent data is most active
✗ Uneven load if data is skewed

Hash-Based

EVEN DISTRIBUTION

shard = hash(key) % N

user_12345 → hash(12345)%4 → Shard 2
Uniformly distributes keys regardless of value patterns.

✓ Even distribution, no hot spots
✓ Simple deterministic routing

✗ Range queries → scatter-gather all shards
✗ Resharding remaps all keys (painful)

Consistent Hashing

MINIMAL REHASH · RING

Map servers + keys to same hash ring (0→2³²). Each key → nearest server clockwise.

Add/remove server → only K/N keys remapped (vs N times more with mod hash).

✓ Adding/removing nodes is non-disruptive
✓ Virtual nodes for even distribution

✗ More complex to implement
✗ Still no range query support

The Hotkey Problem: A single shard key (Taylor Swift's user_id during a concert drop) gets 100× more traffic than others. Solutions: (1) Key suffixing — distribute post_id_0...post_id_9 across shards; (2) Cache in Redis; (3) Per-shard read replicas. Design your shard key to co-locate related data and distribute write load.

Cross-Shard Query Problem

Strategies for queries that span shardsPATTERNS

Problem: SELECT COUNT(*) FROM orders GROUP BY restaurant_id
         orders are sharded by user_id → restaurant data is spread across all shards
         → Must query ALL shards and merge results in application layer

Solution 1: Denormalise (NoSQL pattern)
  Embed related data in the same document.
  User document includes their recent orders → no cross-shard JOIN needed.

Solution 2: Co-locate by access pattern
  Shard ALL of a user's data by user_id.
  Query "all my orders" stays on one shard.
  But "all orders for this restaurant" still requires scatter-gather.

Solution 3: Separate analytics store
  Write events to Kafka → consume into data warehouse (BigQuery, Redshift).
  Cross-tenant/cross-shard analytics run on the warehouse, not the OLTP DB.

Solution 4: Accept scatter-gather for rare queries
  Fan out to all shards, merge in application layer, cache the result aggressively.

Database Selection Guide

The right database for the right job — based on access patterns, not familiarity

Need ACID + complex JOINs?

PostgreSQL / MySQL / Aurora

Need massive write throughput (millions/sec)?

Cassandra / DynamoDB

Need flexible JSON schema per document?

MongoDB / CouchDB / Firestore

Need graph traversal (friends-of-friends)?

Neo4j / Amazon Neptune

Need time-series metrics + monitoring?

InfluxDB / TimescaleDB / Prometheus

Need full-text search + facets?

Elasticsearch / OpenSearch / Algolia

Need distributed SQL (geo-global)?

CockroachDB / Google Spanner

Need in-memory cache / pub-sub / leaderboard?

Redis / Memcached

Comparison Matrix

DATABASE	TYPE	CONSISTENCY	SCALE	BEST FOR
PostgreSQL	Relational	ACID	Vertical + read replicas	Complex queries, transactions, analytics
MySQL / Aurora	Relational	ACID	Vertical + read replicas	Web apps, proven reliability, Aurora serverless
MongoDB	Document	Tunable	Horizontal sharding	Flexible schema, catalogs, user profiles
Cassandra	Wide-column	Eventual / Tunable	Horizontal (massive)	High-write IoT, time-series, messaging
DynamoDB	KV / Document	Eventual / Strong	Horizontal (managed)	Serverless, variable load, simple access patterns
Redis	Key-Value	Strong (single node)	Cluster mode	Cache, sessions, pub/sub, sorted sets
Elasticsearch	Search	Eventual	Horizontal	Full-text search, log analysis, facets
InfluxDB	Time-series	Eventual	Horizontal	Metrics, monitoring, IoT telemetry

Interview Tips

QUESTION	STRONG ANSWER
"SQL or NoSQL?"	Depends on access patterns. Complex JOINs + ACID → SQL. Known simple key lookups + horizontal scale → NoSQL. Tell me the access patterns first.
"How does B-tree indexing work?"	Balanced tree keeps keys sorted. Binary search on keys = O(log n). Range queries traverse adjacent leaves. Cost: every write updates all indexes.
"How to handle a hotkey?"	Cache in Redis (avoid DB entirely), key suffixing (distribute load across N keys), or per-shard read replicas for the hot shard.
"Primary vs replica reads?"	Replica reads are faster but may be stale by replication lag. Use primary for user's own writes (read-your-writes) and financial data. Replica is fine for other users' public data.
"ACID vs BASE?"	ACID: all-or-nothing transactions, strong consistency — right for banking, booking. BASE: available, eventually consistent — right for social likes, analytics.

01

Index Design — Orders Table

~1.5 hrs

›

CREATE TABLE orders (
    id            BIGINT PRIMARY KEY,
    user_id       INT,
    restaurant_id INT,
    status        VARCHAR(20),  -- 'pending', 'delivered', 'cancelled'
    total         DECIMAL(10,2),
    created_at    TIMESTAMP
);

Query patterns:
  Q1: All orders for a specific user (most frequent)
  Q2: Recent orders for a restaurant ordered by time
  Q3: All pending orders (dashboard refresh every 5s)
  Q4: Orders by user in a date range

For each query:
  1. Design the optimal index (name columns + order)
  2. Explain the B-tree traversal path
  3. Identify if a covering index is possible
  4. Explain the INSERT/UPDATE cost of your index choice
  Bonus: Q3 — would a partial index help? Why?

02

Sharding Design — Multi-Tenant SaaS

~1.5 hrs

›

10,000 business tenants, each with 10K–1M records. Most queries scoped to single tenant. 3 large tenants = 60% of traffic. Occasional cross-tenant analytics.

What is the shard key? Why?
Which sharding strategy: range, hash, or consistent hashing? Why?
How do you handle the 3 hot tenants without letting them overload one shard?
How do you serve cross-tenant analytics without scatter-gathering all shards?
What happens when a tenant grows 10× — does your design need changes?

03

Replication Lag Decision Scenarios

~1 hr

›

Social media platform: primary DB for writes, 3 read replicas, replication lag 100ms–2s.

For each, decide: primary or replica? Justify with the consistency model needed.

User views their own profile immediately after updating it
User views another user's follower count
User's home feed (posts from people they follow)
Payment confirmation page after completing a purchase
Admin dashboard showing total active users (refreshed every 5 min)

★

Design Instagram's Storage Layer

~3 hrs · full HLD

›

Design complete storage for Instagram. For each data type: DB choice, CAP position, sharding key, storage estimate.

User accounts (500M users)
Photos/Videos (100B media files, avg 3 MB)
Comments (10B comments)
Likes (500B likes — read-heavy, approximate count OK)
Follow relationships (avg 500 followers/user)
Feed (posts from people you follow — fan-out on write vs read)

Required: draw the storage architecture diagram, estimate total storage, identify the hardest consistency challenge.

0 / 14 completedMODULE B2 · DATABASES AT SCALE

B-tree indexing: O(log n), left-most prefix rule, range query support

Composite and covering indexes — when and how to design them

ACID — all 4 properties in depth, including isolation levels

BASE — deliberate trade-off, when it's the right choice

SQL vs NoSQL decision: based on access patterns, not preference

All 5 NoSQL data models: KV, Document, Wide-column, Graph, Time-series

Primary-replica replication + replication lag solutions (4 strategies)

Multi-primary replication — conflict resolution strategies

Range vs hash vs consistent hashing sharding — trade-offs of each

Hotkey problem — 3 solutions: cache, key suffixing, read replicas per shard

Cross-shard query problem and solutions (denorm, co-locate, analytics store)

Can pick the right DB for any scenario in the selection guide

✏️ Tasks 1–3 completed (index design, sharding, replication lag)

✏️ Capstone: Instagram storage layer — full design with estimates

// NEXT MODULE

B3 — Caching

      Redis data structures · CDN · Cache-aside · Write-through · Write-back

      Cache invalidation strategies · TTL vs eviction · Cache stampede

      Read-through cache · Consistent hashing for distributed cache