A backend that is correct but slow is a bug. Phase 4 closes the gap between writing code that works and writing code that scales. The concepts here — event loops, lock-free atomics, epoll — are what separate a 100 req/s prototype from a 100,000 req/s production service.
Everything builds on Phase 0 (Linux syscalls) and Phase 2 (TCP/HTTP): you need to understand file descriptors and sockets before epoll makes sense, and you need HTTP to understand why C10K matters.
- Ph0 — Linux syscalls, file descriptors, process model
- Ph2 — TCP sockets, HTTP request/response lifecycle
- Basic C: pointers, structs, malloc/free
- M01 (networking stack) and M06 (HTTP internals) recommended
- Edge-triggered epoll event loop in C handling 10K+ conns
- Thread pool with mutex + condition variable work queue
- Lock-free MPSC queue using
stdatomic.hCAS - In-process LRU cache (hash map + doubly-linked list)
- Benchmark: thread-per-conn vs epoll event loop
| # | Concept | Tab | Key Insight |
|---|---|---|---|
| C1 | Threading models | Threading | Thread-per-req wastes 8KB stack × N; event loop reuses one thread |
| C2 | Synchronization primitives | Sync | Mutex = binary; semaphore = counting; condvar = signaling |
| C3 | Lock-free CAS / stdatomic.h | Lock-Free | Atomic read-modify-write without kernel involvement |
| C4 | I/O multiplexing evolution | I/O Mux | select→poll→epoll→io_uring; O(n)→O(1) notification |
| C5 | C10K problem | I/O Mux | OS scheduling kills thread-per-conn; epoll is the fix |
| C6 | In-process LRU/LFU caching | Caching | Hash map + doubly-linked list = O(1) get/put/evict |
| C7 | Cache stampede / Redis | Caching | Mutex lock or probabilistic early expiry prevents thundering herd |
| C8 | Connection pool management | Scaling | Pool exhaustion → queue with timeout + backpressure |
| C9 | Load balancing algorithms | Scaling | Consistent hashing minimizes redistribution on node change |
| C10 | Stateless horizontal scaling | Scaling | Externalize all state; make ops idempotent |
The threading model is the most consequential architectural decision in a concurrent server. It determines memory footprint, latency profile, CPU utilisation, and how hard the code is to reason about. There are four canonical models.
| Model | How It Works | Memory / 10K Conns | Best For | Real Examples |
|---|---|---|---|---|
| Thread-per-request | One OS thread per active connection; blocks on I/O | ~80 MB (8KB stack × 10K) | Low concurrency, simple CRUD | Apache prefork, early Java servlets |
| Thread pool | Fixed N threads; work queue; threads pick up tasks | ~800 KB (N=100 threads) | CPU-bound tasks, mixed workloads | Apache worker MPM, gRPC server |
| Event loop | Single thread; I/O multiplexing (epoll); non-blocking | ~kilobytes per conn | High-concurrency I/O-bound services | Nginx, Node.js, Redis |
| Green threads / M:N | Userspace scheduler maps M goroutines → N OS threads | ~2 KB stack (growable) | Mixed I/O+CPU; simple code | Go goroutines, Rust async/tokio, Erlang |
- Stack memory: Linux default 8 MB ulimit, minimum 4–8 KB actual; 10K threads = 40–80 MB minimum
- Context switching: Each OS thread switch costs ~1–10 µs (TLB flush, register save); 10K threads fighting for CPU = scheduling storm
- Blocking I/O: Thread sits idle during syscall (read, write, accept). CPU does nothing useful
- Memory bandwidth: Each sleeping thread's stack still occupies virtual memory — page faults on wake
- No blocking: All I/O is async; thread never waits
- One thread → thousands of connections: epoll delivers ready FDs in O(1)
- Cache-friendly: Single thread has hot L1/L2; no context switch overhead
- Predictable latency: No lock contention between connections
- Limitation: CPU-bound tasks block the loop — offload to thread pool
Best of both worlds: event loop accepts connections, but dispatches CPU-bound work to a thread pool. The key design decisions are queue depth (prevents unbounded memory growth), number of workers (usually CPU cores × 1–2), and what to do when the queue is full (reject with 503, or block with timeout).
Every synchronization primitive solves a specific problem. Using the wrong one either causes deadlock, starvation, or unnecessary serialization. Master this decision table before writing any concurrent code.
| Primitive | Semantics | pthreads API | Use When | Pitfall |
|---|---|---|---|---|
| Mutex | Binary lock — exclusive access to critical section | pthread_mutex_t | Protecting shared data structure (queue, hash map) | Lock ordering violations → deadlock |
| RW Lock | Multiple concurrent readers OR one exclusive writer | pthread_rwlock_t | Read-heavy workloads (config, caches, routing tables) | Writer starvation if readers never yield |
| Semaphore | Counting lock — allows N simultaneous holders | sem_t / sem_init | Rate limiting, connection pools, resource counting | Forgetting sem_post → deadlock; don't use as mutex |
| Condition Variable | Wait for a predicate; always paired with mutex | pthread_cond_t | Producer/consumer, work queues, thread pools | Spurious wakeups — always check predicate in loop |
| Spinlock | Busy-wait loop; no kernel involvement | pthread_spinlock_t | Very short critical sections on multi-core only | Wastes CPU if held for >~100 ns; never block inside |
pthread_mutex_trylock with backoff.
Hold locks for the shortest time possible. Never do I/O, system calls, or expensive computation while holding a mutex. Extract the data you need under the lock, release, then process. This maximises throughput and minimises tail latency.
| Anti-Pattern | Fix |
|---|---|
Call send() while holding mutex | Copy data out, release lock, then send |
| malloc inside critical section | Pre-allocate or allocate before locking |
| Nested locks without ordering | Define global lock hierarchy by module/enum |
| Spinlock on single-core machine | Use mutex — spinning just wastes quanta |
Lock-free algorithms allow multiple threads to make progress without mutual exclusion. The foundation is Compare-And-Swap (CAS) — an atomic operation that reads, compares, and conditionally writes in a single uninterruptible hardware instruction.
The operation atomically performs:
Retry loops (optimistic concurrency) are used when CAS fails — re-read the current value and try again. Progress is guaranteed for at least one thread in each round.
CAS checks value equality not identity. If a pointer changes A → B → A between your read and your CAS, the CAS succeeds even though the world has changed underneath you — leading to use-after-free or silent data corruption in lock-free linked lists and queues.
| Problem | Fix |
|---|---|
| Pointer A freed and reallocated at same address | Double-width CAS: combine pointer + version counter (128-bit CAS on x86-64) |
| Node popped from stack, pushed back | Tagged pointer — store generation counter in low bits (alignment gives free bits) |
| Hazard pointer technique | Threads publish pointers they are reading; reclamation checks the hazard list before freeing |
Multi-producer single-consumer queues are the most practical lock-free structure for server internals. Producers do a CAS on the tail; the single consumer pops without a lock (no CAS needed on head with SPSC guarantee).
memory_order_relaxed for statistics counters (no ordering needed). Use acquire/release pairs for producer-consumer handoffs. Use seq_cst only when you need a global total order — it is the slowest (full fence on x86, heavyweight on ARM).
I/O multiplexing lets a single thread wait on multiple file descriptors simultaneously, being notified only when one is ready — eliminating the need for a thread per connection. The Linux kernel has evolved four generations of this interface over 30 years.
| API | Year | Notification Model | Max FDs | Complexity | Verdict |
|---|---|---|---|---|---|
select | 1983 | Bitmask scan on return | 1024 (FD_SETSIZE) | O(n) per call | Legacy only |
poll | 1986 | Linear scan of pollfd array | Unlimited | O(n) per call | Portable but slow |
epoll | 2002 | Kernel event queue; ready list | Unlimited | O(1) add/wait | Linux standard |
io_uring | 2019 | Ring buffers; zero syscall per I/O | Unlimited | O(1) + amortised | Linux 5.1+, future |
| Mode | When kernel notifies | Read behaviour |
|---|---|---|
| LT (default) | Every time FD is readable | Can read partial — notified again |
ET (EPOLLET) | Only on state transition (new data arrives) | Must read until EAGAIN — otherwise miss data |
Linux 5.1 (2019) introduced io_uring, built around two lock-free ring buffers shared between kernel and userspace:
| Feature | epoll | io_uring |
|---|---|---|
| Syscall per I/O | Yes (read/write + epoll_wait) | No (ring buffer, batched) |
| Zero-copy | No | Yes (fixed buffers) |
| Async file I/O | No (files always "ready") | Yes (true async) |
| Portability | Linux only | Linux 5.1+ only |
Dan Kegel's 1999 paper posed the question: how do you handle 10,000 simultaneous connections on a single server? At the time, the dominant model was one thread per connection. The math:
- 10K threads × 8 KB stack = 80 MB minimum (but pages are faulted in, TLB pressure)
- Context switch cost: ~1–10 µs each; 10K threads × 100 context switches/sec = 1–10 seconds of CPU time per second just switching
- Kernel scheduler: O(log n) for CFS scheduler; 10K sleeping threads compete for wake-up slots
Solution: epoll event loop. One thread. No context switches between connections. O(1) delivery of ready events. Nginx handles 50K–100K concurrent connections on a single core with this model.
Caching is the single most impactful performance optimisation available to a backend engineer. The key insight: most production workloads exhibit a power-law access pattern — 20% of keys account for 80% of requests. Keeping that hot 20% in memory eliminates most database round-trips.
The classic O(1) LRU implementation combines:
- Hash map: O(1) lookup — key → pointer to list node
- Doubly-linked list: O(1) insert/remove; MRU at head, LRU at tail
Evicts the key accessed the fewest times. Better than LRU for workloads where old-but-popular keys should stay (e.g. homepage, top products).
Implementation: min-heap keyed by frequency counter + hash map. On each access, increment freq, re-heapify. O(log n) per operation. Min-heap of doubly-linked lists (one list per frequency bucket) gives O(1).
| Metric | Formula / Target |
|---|---|
| Hit rate | cache_hits / (hits + misses) → target >90% |
| Working set size | unique_hot_keys × avg_value_bytes |
| Memory budget | JVM/process heap × 30% (leave room for GC, buffers) |
| TTL vs eviction | TTL for correctness; capacity eviction for memory |
When a popular key's TTL expires simultaneously for N threads/processes, they all miss the cache and hammer the database together. The database sees a sudden N× spike in traffic. Three mitigations:
- Mutex lock on cache miss: First miss acquires lock and refreshes; other waiters read the freshly populated value. Problem: lock adds latency for all waiters.
-
Probabilistic early expiry (XFetch): Randomly refresh before TTL expires — probability increases as expiry approaches. Zero extra infrastructure, no lock. Formula:
if current_time - ttl * β * log(rand()) > expiry_time → refresh - Background refresh: Serve stale value immediately; async worker refreshes in background. Requires stale-while-revalidate semantics — acceptable when slight staleness is OK.
A single Redis key receiving millions of reads/sec overloads one shard. Solutions:
- Local in-process replica: Read from Redis once, cache in process for 100 ms. Stale but radically cheaper.
- Key splitting: Store
hotkey:0throughhotkey:N-1; reader pickshotkey:{random 0..N-1}. Write to all N. Reads spread across N slots/shards. - Read replicas: Route reads to Redis replicas, writes to primary. Redis Cluster supports this natively.
Scaling means keeping latency flat as traffic grows. The three levers are: pooling (amortise connection setup cost), load balancing (distribute traffic optimally), and stateless design (allow trivial horizontal replication).
Opening a TCP connection + TLS handshake + PostgreSQL auth takes 2–50 ms. A connection pool amortises this by keeping a set of pre-established connections ready to use. Key parameters:
| Parameter | Meaning | Typical Value |
|---|---|---|
pool_size | Max simultaneous connections | 10–50 per service instance |
min_idle | Keep-warm connections | 2–5 |
acquire_timeout | Wait before returning error | 500 ms – 2 s |
max_lifetime | Force recycle (leak prevention) | 30 min |
idle_timeout | Close idle conns above min_idle | 5–10 min |
pgBouncer operates at the proxy level — thousands of app connections multiplex onto a smaller PostgreSQL connection pool (transaction-mode pooling). Reduces PostgreSQL's backend process count from thousands to tens.
| Algorithm | How | Best For | Avoid When |
|---|---|---|---|
| Round Robin | Request 1 → backend 1, request 2 → backend 2, … | Homogeneous backends, uniform request duration | Variable request times (some backends pile up) |
| Weighted Round Robin | Backend with weight 2 gets 2× requests | Heterogeneous hardware (some servers are more powerful) | Dynamic load changes |
| Least Connections | Always route to backend with fewest active connections | Variable request durations (DB queries, uploads) | Requires tracking active count — adds state to LB |
| IP Hash | hash(client_ip) mod N | Session stickiness (non-Redis session stores) | Horizontal scaling changes N → all sessions remapped |
| Consistent Hashing | Virtual nodes on a ring; hash(key) → nearest clockwise node | Caching layers (Redis cluster), CDN routing | Very small node counts (variance in distribution) |
The golden rule: no instance-local state. If you can kill any instance and restart it on a different host without losing data, the service is stateless and horizontally scalable.
| State Type | Externalise To | Note |
|---|---|---|
| User sessions | Redis (with TTL) | Session ID in cookie; data in Redis |
| File uploads / blobs | S3 / object storage | Never write to local disk in stateless service |
| Rate limit counters | Redis INCR + EXPIRE | Sliding window or token bucket in Redis |
| Distributed locks | Redis SET NX EX / Redlock | For leader election or single-writer guarantees |
| Job queues | Redis Streams, SQS, Kafka | Workers are stateless consumers |
idempotency_key (UUID) stored in the DB with a unique constraint. On retry, the server detects the duplicate key and returns the original result without re-processing.
gcc -std=c11 -O2 -lpthread. The epoll server requires Linux ≥ 2.6.27.Build three versions of an echo server and compare throughput and latency under 10K concurrent connections using wrk or ab.
server_threaded.c: accept loop spawns pthread_create for each connection. Use pthread_detach to avoid join overhead.server_pool.c: single accept thread enqueues connections to the ThreadPool from this module. Pool has POOL_SIZE=8 workers.server_epoll.c: single-threaded epoll event loop, edge-triggered, non-blocking. Handles reads in the same loop.wrk -t4 -c10000 -d30s http://localhost:8080/. Record req/sec, latency p50/p99, and CPU utilisation (pidstat).ENOMEM, or scheduler thrash visible in htop).Measure the performance difference between a mutex-protected counter and an atomic counter under high contention.
bench_counter.c: two versions of a global counter incremented 10M times across 16 threads.pthread_mutex_t wrapping counter++atomic_fetch_add(&counter, 1, memory_order_relaxed)gcc -O2 -std=c11 bench_counter.c -lpthread -o bench. Run 5× and average. Expected: atomic 3–10× faster.Extend the LRU cache from the C Implementation tab to be thread-safe, then measure hit rate under a Zipf-distributed workload (80/20 rule).
pthread_rwlock_t to the LRUCache struct. Protect lru_get with rdlock and lru_put with wrlock.rand() % (n * zipf_skew) where lower keys are accessed more often. Use 1000 unique keys.perf stat: compare cache-miss events at different capacities. Observe L1/L2 hit rates alongside LRU hit rates.Implement a consistent hash ring in C and measure key redistribution when adding/removing nodes.
{hash_value, node_id} pairs. Use 150 virtual nodes per physical node (improves balance). Hash function: fnv1a(node_name + "_vnode_" + i).ring_add_node(ring, "node1") and ring_remove_node(ring, "node1") — insert/delete 150 entries, keeping array sorted (insertion sort or qsort).ring_get_node(ring, key): hash the key, binary-search the sorted array for the nearest clockwise entry.- Can explain thread-per-request memory/scheduling cost at 10K connections
- Can implement a bounded thread pool with mutex + condvar work queue in C
- Understand event loop model: single thread, non-blocking I/O, epoll notification
- Know when to use green threads (Go/Rust) vs explicit thread pools
- Can choose between mutex, RW lock, semaphore, condvar for a given use case
- Know the deadlock recipe and how to prevent it with lock ordering
- Always wrap pthread_cond_wait in a while loop (spurious wakeup guard)
- Understand spinlock trade-offs: only for sub-100ns critical sections on SMP
- Can explain CAS semantics: atomically read, compare, conditionally swap
- Know the ABA problem and two mitigations (tagged pointers, hazard pointers)
- Can use
stdatomic.h: atomic_load/store/fetch_add/compare_exchange - Understand memory ordering: relaxed vs acquire/release vs seq_cst
- Know select/poll/epoll/io_uring evolution and O(n)→O(1) improvement
- Can implement an epoll edge-triggered event loop in C from scratch
- Understand ET vs LT: ET requires reading until EAGAIN, LT re-notifies
- Know io_uring ring buffer design and why it reduces syscall overhead
- Can explain why 10K threads fail (stack, context switch cost)
- Can implement LRU cache with O(1) get/put using doubly-linked list + hash map
- Understand LFU vs LRU trade-offs for different access distributions
- Can explain cache stampede and implement mutex-lock or XFetch mitigation
- Know Redis hotspot key sharding strategies (key splitting, local replica)
- Know pgBouncer transaction-mode pooling and when pool exhaustion triggers
- Can compare round-robin, least-conn, and consistent hashing for a use case
- Understand consistent hashing: O(K/N) key redistribution vs O(K) for modulo
- Can design a stateless service: externalise sessions, use idempotency keys