Indexing
Why queries go from O(n) to O(log n) — and what it costs
B-Tree Index
Balanced search tree of (key → row pointer) pairs. Keys sorted → enables range queries, ORDER BY, binary search in O(log n).
USE: equality (=), range (<, >, BETWEEN), ORDER BY, JOIN ON
DEFAULT: PostgreSQL, MySQL, Oracle
DEFAULT: PostgreSQL, MySQL, Oracle
Hash Index
Hash map of key → row pointer. O(1) exact lookup. Keys unordered — range queries impossible.
USE: equality only (=). Never for ranges.
EXAMPLES: Memcached, Redis, MySQL MEMORY engine
EXAMPLES: Memcached, Redis, MySQL MEMORY engine
Composite Index
Index on (col_A, col_B). Left-most prefix rule: queries must start with col_A to use the index. Most selective column goes first.
INDEX(user_id, created_at)
✓ WHERE user_id=5
✓ WHERE user_id=5 AND created_at>X
✗ WHERE created_at>X (skips prefix)
✓ WHERE user_id=5
✓ WHERE user_id=5 AND created_at>X
✗ WHERE created_at>X (skips prefix)
Covering Index — query satisfied entirely from indexSQL
-- Table: orders (10M rows) -- Query: SELECT user_id, created_at FROM orders WHERE user_id = 5 -- Regular index on user_id: -- 1. B-tree lookup → row pointer -- 2. Fetch row from heap (random disk I/O) to read created_at -- Covering index on (user_id, created_at): CREATE INDEX idx_covering ON orders(user_id, created_at); -- 1. B-tree lookup → both columns found IN the index leaf -- 2. NO heap access at all → drastically reduced I/O -- ✅ PostgreSQL calls this "Index Only Scan" -- Index design for common queries: CREATE INDEX idx_user ON orders(user_id); -- Q1: orders by user CREATE INDEX idx_rest_t ON orders(restaurant_id, created_at); -- Q2: restaurant + time range CREATE INDEX idx_status ON orders(status) WHERE status = 'pending'; -- Q3: partial index! CREATE INDEX idx_user_d ON orders(user_id, created_at); -- Q4: user + date range
Index cost trade-off: Every write (INSERT/UPDATE/DELETE) must update all indexes on that table. A table with 10 indexes needs 10 B-tree updates per write. Index only columns used in WHERE, JOIN ON, ORDER BY. Low-cardinality columns (boolean, status with 3 values) often give poor selectivity — the optimizer may prefer a full scan.
ACID vs BASE
The consistency guarantee spectrum — choose based on use case
ACID
RELATIONAL DATABASES · STRONG GUARANTEES
A
Atomicity — Transaction is all-or-nothing. Transfer $100: if debit succeeds but credit fails, both are rolled back. No partial writes survive.
C
Consistency — DB transitions from one valid state to another. Constraints (FK, UNIQUE, NOT NULL, CHECK) always satisfied post-transaction.
I
Isolation — Concurrent transactions don't see each other's in-progress writes. Levels: READ COMMITTED (default PG) → REPEATABLE READ → SERIALIZABLE.
D
Durability — Committed transactions survive crashes. Guaranteed via WAL (write-ahead log) + fsync to disk before confirming commit.
BASE
NOSQL / DISTRIBUTED · AVAILABLE BY DESIGN
B
Basically Available — System always responds, even if response is partial or stale. Rejects requests only in extreme failure, not during replication lag.
S
Soft State — System state can change over time even without new inputs. Replicas catching up, caches expiring, tombstones propagating.
E
Eventually Consistent — Given no new updates, all replicas converge to the same state. No guarantee on when. Reads may return stale data.
BASE is not "broken ACID" — it's a deliberate design choice.
Cassandra, DynamoDB, CouchDB are BASE by design.
Trade: strong consistency → availability + throughput
Cassandra, DynamoDB, CouchDB are BASE by design.
Trade: strong consistency → availability + throughput
Isolation Levels Deep-Dive
| ISOLATION LEVEL | DIRTY READ | NON-REPEATABLE | PHANTOM READ | DEFAULT IN |
|---|---|---|---|---|
| READ UNCOMMITTED | Allowed | Allowed | Allowed | — |
| READ COMMITTED | Prevented | Allowed | Allowed | PostgreSQL |
| REPEATABLE READ | Prevented | Prevented | Allowed | MySQL InnoDB |
| SERIALIZABLE | Prevented | Prevented | Prevented | Highest isolation |
Interview pattern: Higher isolation level = fewer anomalies but more locking = lower throughput. Most applications run at READ COMMITTED. SERIALIZABLE is used for financial systems where phantom reads would cause incorrect calculations (e.g., calculating remaining inventory before inserting an order).
SQL vs NoSQL
Not a religion — a trade-off based on access patterns and consistency needs
| DIMENSION | SQL (Relational) | NoSQL (Distributed) |
|---|---|---|
| Data model | Tables, rows, strict schema, FK constraints | Documents, KV, wide-column, graph — flexible |
| Query language | SQL — JOIN, GROUP BY, window functions | Simple API: Get/Put/Scan or limited query DSL |
| Consistency | ACID transactions guaranteed | BASE; tunable per-operation (Cassandra) |
| Scaling | Vertical + read replicas; sharding is hard | Horizontal sharding built in (designed for it) |
| Use when | Complex queries, JOINs, transactions, stable schema | Known simple access patterns, massive scale, flexible schema |
| Examples | PostgreSQL, MySQL, Aurora, SQLite | DynamoDB, Cassandra, MongoDB, Redis, Neo4j |
The 5 NoSQL Data Models
Replication
How to survive node failures and serve more reads
// PRIMARY-REPLICA (LEADER-FOLLOWER)
WRITE
CLIENT
CLIENT
──→
PRIMARY
(accepts writes)
(accepts writes)
──async──→
REPLICA 1
←── READ CLIENT
──async──→
REPLICA 2
←── READ CLIENT
✓ Scales reads horizontally (add more replicas)
✗ Replication lag: reads from replica may be stale
✗ Primary is still single write point
Replication Lag — Solutions
Strategies to handle stale readsPATTERNS
Problem: User updates profile → reads from replica → sees old data (lag ~100ms–2s) Strategy 1: Read-your-own-writes Route user's reads to PRIMARY for their own data only. How: track last_write_time per user; if recent → route to primary. Cost: extra load on primary for the write author's reads. Strategy 2: Monotonic Reads Always route same user to same replica. Prevents user seeing data "go backwards" (newer on one replica, older on next). How: Hash(userId) % numReplicas → sticky routing. Strategy 3: Semi-Synchronous Replication Primary waits for ACK from at least 1 replica before confirming write. Zero data loss on primary crash (at least 1 replica has the write). Cost: write latency += 1 network RTT to replica. Strategy 4: Route critical paths to primary Payment confirmation, inventory check → always read from primary. Profile photos, comment counts → can read from replica.
Multi-Primary (Active-Active) Replication
| ASPECT | PRIMARY-REPLICA | MULTI-PRIMARY |
|---|---|---|
| Write acceptance | Only primary accepts writes | Any node accepts writes |
| Write scaling | Limited to primary capacity | Writes scale across all nodes |
| Conflict risk | None (single writer) | High — concurrent writes to same row |
| Conflict resolution | N/A | LWW timestamp, CRDTs, app-level merge |
| Use case | Read-heavy web apps | Geo-distributed active-active, Cassandra |
Sharding (Horizontal Partitioning)
Distributing rows across multiple DB nodes — when one machine isn't enough
The Hotkey Problem: A single shard key (Taylor Swift's user_id during a concert drop) gets 100× more traffic than others. Solutions: (1) Key suffixing — distribute post_id_0...post_id_9 across shards; (2) Cache in Redis; (3) Per-shard read replicas. Design your shard key to co-locate related data and distribute write load.
Cross-Shard Query Problem
Strategies for queries that span shardsPATTERNS
Problem: SELECT COUNT(*) FROM orders GROUP BY restaurant_id orders are sharded by user_id → restaurant data is spread across all shards → Must query ALL shards and merge results in application layer Solution 1: Denormalise (NoSQL pattern) Embed related data in the same document. User document includes their recent orders → no cross-shard JOIN needed. Solution 2: Co-locate by access pattern Shard ALL of a user's data by user_id. Query "all my orders" stays on one shard. But "all orders for this restaurant" still requires scatter-gather. Solution 3: Separate analytics store Write events to Kafka → consume into data warehouse (BigQuery, Redshift). Cross-tenant/cross-shard analytics run on the warehouse, not the OLTP DB. Solution 4: Accept scatter-gather for rare queries Fan out to all shards, merge in application layer, cache the result aggressively.
Database Selection Guide
The right database for the right job — based on access patterns, not familiarity
Need ACID + complex JOINs?
PostgreSQL / MySQL / Aurora
Need massive write throughput (millions/sec)?
Cassandra / DynamoDB
Need flexible JSON schema per document?
MongoDB / CouchDB / Firestore
Need graph traversal (friends-of-friends)?
Neo4j / Amazon Neptune
Need time-series metrics + monitoring?
InfluxDB / TimescaleDB / Prometheus
Need full-text search + facets?
Elasticsearch / OpenSearch / Algolia
Need distributed SQL (geo-global)?
CockroachDB / Google Spanner
Need in-memory cache / pub-sub / leaderboard?
Redis / Memcached
Comparison Matrix
| DATABASE | TYPE | CONSISTENCY | SCALE | BEST FOR |
|---|---|---|---|---|
| PostgreSQL | Relational | ACID | Vertical + read replicas | Complex queries, transactions, analytics |
| MySQL / Aurora | Relational | ACID | Vertical + read replicas | Web apps, proven reliability, Aurora serverless |
| MongoDB | Document | Tunable | Horizontal sharding | Flexible schema, catalogs, user profiles |
| Cassandra | Wide-column | Eventual / Tunable | Horizontal (massive) | High-write IoT, time-series, messaging |
| DynamoDB | KV / Document | Eventual / Strong | Horizontal (managed) | Serverless, variable load, simple access patterns |
| Redis | Key-Value | Strong (single node) | Cluster mode | Cache, sessions, pub/sub, sorted sets |
| Elasticsearch | Search | Eventual | Horizontal | Full-text search, log analysis, facets |
| InfluxDB | Time-series | Eventual | Horizontal | Metrics, monitoring, IoT telemetry |
Interview Tips
| QUESTION | STRONG ANSWER |
|---|---|
| "SQL or NoSQL?" | Depends on access patterns. Complex JOINs + ACID → SQL. Known simple key lookups + horizontal scale → NoSQL. Tell me the access patterns first. |
| "How does B-tree indexing work?" | Balanced tree keeps keys sorted. Binary search on keys = O(log n). Range queries traverse adjacent leaves. Cost: every write updates all indexes. |
| "How to handle a hotkey?" | Cache in Redis (avoid DB entirely), key suffixing (distribute load across N keys), or per-shard read replicas for the hot shard. |
| "Primary vs replica reads?" | Replica reads are faster but may be stale by replication lag. Use primary for user's own writes (read-your-writes) and financial data. Replica is fine for other users' public data. |
| "ACID vs BASE?" | ACID: all-or-nothing transactions, strong consistency — right for banking, booking. BASE: available, eventually consistent — right for social likes, analytics. |
01
Index Design — Orders Table
›
CREATE TABLE orders (
id BIGINT PRIMARY KEY,
user_id INT,
restaurant_id INT,
status VARCHAR(20), -- 'pending', 'delivered', 'cancelled'
total DECIMAL(10,2),
created_at TIMESTAMP
);
Query patterns:
Q1: All orders for a specific user (most frequent)
Q2: Recent orders for a restaurant ordered by time
Q3: All pending orders (dashboard refresh every 5s)
Q4: Orders by user in a date range
For each query:
1. Design the optimal index (name columns + order)
2. Explain the B-tree traversal path
3. Identify if a covering index is possible
4. Explain the INSERT/UPDATE cost of your index choice
Bonus: Q3 — would a partial index help? Why?
02
Sharding Design — Multi-Tenant SaaS
›
10,000 business tenants, each with 10K–1M records. Most queries scoped to single tenant. 3 large tenants = 60% of traffic. Occasional cross-tenant analytics.
- What is the shard key? Why?
- Which sharding strategy: range, hash, or consistent hashing? Why?
- How do you handle the 3 hot tenants without letting them overload one shard?
- How do you serve cross-tenant analytics without scatter-gathering all shards?
- What happens when a tenant grows 10× — does your design need changes?
03
Replication Lag Decision Scenarios
›
Social media platform: primary DB for writes, 3 read replicas, replication lag 100ms–2s.
For each, decide: primary or replica? Justify with the consistency model needed.
- User views their own profile immediately after updating it
- User views another user's follower count
- User's home feed (posts from people they follow)
- Payment confirmation page after completing a purchase
- Admin dashboard showing total active users (refreshed every 5 min)
★
Design Instagram's Storage Layer
›
Design complete storage for Instagram. For each data type: DB choice, CAP position, sharding key, storage estimate.
- User accounts (500M users)
- Photos/Videos (100B media files, avg 3 MB)
- Comments (10B comments)
- Likes (500B likes — read-heavy, approximate count OK)
- Follow relationships (avg 500 followers/user)
- Feed (posts from people you follow — fan-out on write vs read)
Required: draw the storage architecture diagram, estimate total storage, identify the hardest consistency challenge.
0 / 14 completedMODULE B2 · DATABASES AT SCALE
B-tree indexing: O(log n), left-most prefix rule, range query support
Composite and covering indexes — when and how to design them
ACID — all 4 properties in depth, including isolation levels
BASE — deliberate trade-off, when it's the right choice
SQL vs NoSQL decision: based on access patterns, not preference
All 5 NoSQL data models: KV, Document, Wide-column, Graph, Time-series
Primary-replica replication + replication lag solutions (4 strategies)
Multi-primary replication — conflict resolution strategies
Range vs hash vs consistent hashing sharding — trade-offs of each
Hotkey problem — 3 solutions: cache, key suffixing, read replicas per shard
Cross-shard query problem and solutions (denorm, co-locate, analytics store)
Can pick the right DB for any scenario in the selection guide
✏️ Tasks 1–3 completed (index design, sharding, replication lag)
✏️ Capstone: Instagram storage layer — full design with estimates
// NEXT MODULE
B3 — Caching
Redis data structures · CDN · Cache-aside · Write-through · Write-back
Cache invalidation strategies · TTL vs eviction · Cache stampede
Read-through cache · Consistent hashing for distributed cache
Cache invalidation strategies · TTL vs eviction · Cache stampede
Read-through cache · Consistent hashing for distributed cache