Phase 4 · Module 11
Concurrency & Performance
Threading models · Synchronization · Lock-free · epoll/io_uring · Caching · Load Balancing
pthreads epoll io_uring stdatomic.h Redis pgBouncer C / Linux
⚡ Why This Phase Matters

A backend that is correct but slow is a bug. Phase 4 closes the gap between writing code that works and writing code that scales. The concepts here — event loops, lock-free atomics, epoll — are what separate a 100 req/s prototype from a 100,000 req/s production service.

Everything builds on Phase 0 (Linux syscalls) and Phase 2 (TCP/HTTP): you need to understand file descriptors and sockets before epoll makes sense, and you need HTTP to understand why C10K matters.

🔗 Prerequisites
  • Ph0 — Linux syscalls, file descriptors, process model
  • Ph2 — TCP sockets, HTTP request/response lifecycle
  • Basic C: pointers, structs, malloc/free
  • M01 (networking stack) and M06 (HTTP internals) recommended
🎯 What You Will Build
  • Edge-triggered epoll event loop in C handling 10K+ conns
  • Thread pool with mutex + condition variable work queue
  • Lock-free MPSC queue using stdatomic.h CAS
  • In-process LRU cache (hash map + doubly-linked list)
  • Benchmark: thread-per-conn vs epoll event loop
📋 Module Roadmap — 10 Concepts
#ConceptTabKey Insight
C1Threading modelsThreadingThread-per-req wastes 8KB stack × N; event loop reuses one thread
C2Synchronization primitivesSyncMutex = binary; semaphore = counting; condvar = signaling
C3Lock-free CAS / stdatomic.hLock-FreeAtomic read-modify-write without kernel involvement
C4I/O multiplexing evolutionI/O Muxselect→poll→epoll→io_uring; O(n)→O(1) notification
C5C10K problemI/O MuxOS scheduling kills thread-per-conn; epoll is the fix
C6In-process LRU/LFU cachingCachingHash map + doubly-linked list = O(1) get/put/evict
C7Cache stampede / RedisCachingMutex lock or probabilistic early expiry prevents thundering herd
C8Connection pool managementScalingPool exhaustion → queue with timeout + backpressure
C9Load balancing algorithmsScalingConsistent hashing minimizes redistribution on node change
C10Stateless horizontal scalingScalingExternalize all state; make ops idempotent
🧵 C1 — Four Threading Models Every Backend Engineer Must Know

The threading model is the most consequential architectural decision in a concurrent server. It determines memory footprint, latency profile, CPU utilisation, and how hard the code is to reason about. There are four canonical models.

ModelHow It WorksMemory / 10K ConnsBest ForReal Examples
Thread-per-requestOne OS thread per active connection; blocks on I/O~80 MB (8KB stack × 10K)Low concurrency, simple CRUDApache prefork, early Java servlets
Thread poolFixed N threads; work queue; threads pick up tasks~800 KB (N=100 threads)CPU-bound tasks, mixed workloadsApache worker MPM, gRPC server
Event loopSingle thread; I/O multiplexing (epoll); non-blocking~kilobytes per connHigh-concurrency I/O-bound servicesNginx, Node.js, Redis
Green threads / M:NUserspace scheduler maps M goroutines → N OS threads~2 KB stack (growable)Mixed I/O+CPU; simple codeGo goroutines, Rust async/tokio, Erlang
⚠️ Why Thread-per-Request Breaks at Scale
  • Stack memory: Linux default 8 MB ulimit, minimum 4–8 KB actual; 10K threads = 40–80 MB minimum
  • Context switching: Each OS thread switch costs ~1–10 µs (TLB flush, register save); 10K threads fighting for CPU = scheduling storm
  • Blocking I/O: Thread sits idle during syscall (read, write, accept). CPU does nothing useful
  • Memory bandwidth: Each sleeping thread's stack still occupies virtual memory — page faults on wake
✅ Event Loop Model Advantages
  • No blocking: All I/O is async; thread never waits
  • One thread → thousands of connections: epoll delivers ready FDs in O(1)
  • Cache-friendly: Single thread has hot L1/L2; no context switch overhead
  • Predictable latency: No lock contention between connections
  • Limitation: CPU-bound tasks block the loop — offload to thread pool
🔩 Thread Pool Pattern — Bounded Work Queue

Best of both worlds: event loop accepts connections, but dispatches CPU-bound work to a thread pool. The key design decisions are queue depth (prevents unbounded memory growth), number of workers (usually CPU cores × 1–2), and what to do when the queue is full (reject with 503, or block with timeout).

Incoming Requests │ ▼ [ Accept Thread ] ──── epoll event loop │ ▼ enqueue task [ Bounded Work Queue ] capacity = 1024 │ ┌───┴─────────────────────────────────┐ ▼ ▼ ▼ ▼ Worker 1 Worker 2 Worker 3 ... Worker N (thread) (thread) (thread) (thread) │ ▼ queue full? → 503 Service Unavailable (backpressure)
Analogy — Restaurant Kitchen: Thread-per-request = hire a new chef for every diner (chaos, expensive). Thread pool = hire 8 chefs, queue orders (Apache worker MPM). Event loop = one chef who never waits — checks oven, assembles plate, checks next order, never blocks (Nginx). Green threads = chef clones who can pause mid-task and swap (Go goroutines).
🔧 pthreads Quick Reference
#include <pthread.h> // Link with -lpthread /* Create a thread */ pthread_t tid; pthread_create(&tid, NULL, worker_fn, arg); pthread_join(tid, NULL); /* block until done */ pthread_detach(tid); /* auto-cleanup on exit */ /* Thread function signature */ void *worker_fn(void *arg) { int *val = (int *)arg; /* ... do work ... */ return NULL; }
🔐 C2 — Synchronization Primitives: When to Use Each

Every synchronization primitive solves a specific problem. Using the wrong one either causes deadlock, starvation, or unnecessary serialization. Master this decision table before writing any concurrent code.

PrimitiveSemanticspthreads APIUse WhenPitfall
MutexBinary lock — exclusive access to critical sectionpthread_mutex_tProtecting shared data structure (queue, hash map)Lock ordering violations → deadlock
RW LockMultiple concurrent readers OR one exclusive writerpthread_rwlock_tRead-heavy workloads (config, caches, routing tables)Writer starvation if readers never yield
SemaphoreCounting lock — allows N simultaneous holderssem_t / sem_initRate limiting, connection pools, resource countingForgetting sem_post → deadlock; don't use as mutex
Condition VariableWait for a predicate; always paired with mutexpthread_cond_tProducer/consumer, work queues, thread poolsSpurious wakeups — always check predicate in loop
SpinlockBusy-wait loop; no kernel involvementpthread_spinlock_tVery short critical sections on multi-core onlyWastes CPU if held for >~100 ns; never block inside
🔒 Mutex Pattern — Work Queue
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t cond = PTHREAD_COND_INITIALIZER; int queue_size = 0; /* Producer */ pthread_mutex_lock(&lock); enqueue(item); queue_size++; pthread_cond_signal(&cond); pthread_mutex_unlock(&lock); /* Consumer — always loop on predicate */ pthread_mutex_lock(&lock); while (queue_size == 0) /* spurious wakeup guard */ pthread_cond_wait(&cond, &lock); item = dequeue(); queue_size--; pthread_mutex_unlock(&lock);
📖 RW Lock Pattern — Config Cache
pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER; /* Multiple threads reading simultaneously */ pthread_rwlock_rdlock(&rwlock); val = config_get("max_conns"); pthread_rwlock_unlock(&rwlock); /* One thread writing (reloads config) */ pthread_rwlock_wrlock(&rwlock); config_reload(); /* exclusive */ pthread_rwlock_unlock(&rwlock);
🔢 Semaphore Pattern — Connection Pool Rate Limiter
#include <semaphore.h> sem_t pool_sem; sem_init(&pool_sem, 0, 20); /* allow 20 concurrent DB conns */ /* Acquire a slot — blocks if pool is exhausted */ sem_wait(&pool_sem); /* decrement; block if 0 */ conn = pool_acquire(); do_query(conn); pool_release(conn); sem_post(&pool_sem); /* increment; wake waiter */ /* Non-blocking try */ if (sem_trywait(&pool_sem) == -1) { /* pool full → return 503 immediately */ }
Deadlock Recipe: Lock A then B in thread 1; lock B then A in thread 2. Prevent with a global lock ordering (always acquire in same order by address or enum), or use pthread_mutex_trylock with backoff.
📐 Lock Granularity Rule

Hold locks for the shortest time possible. Never do I/O, system calls, or expensive computation while holding a mutex. Extract the data you need under the lock, release, then process. This maximises throughput and minimises tail latency.

Anti-PatternFix
Call send() while holding mutexCopy data out, release lock, then send
malloc inside critical sectionPre-allocate or allocate before locking
Nested locks without orderingDefine global lock hierarchy by module/enum
Spinlock on single-core machineUse mutex — spinning just wastes quanta
⚛️ C3 — Lock-Free Programming with CAS and stdatomic.h

Lock-free algorithms allow multiple threads to make progress without mutual exclusion. The foundation is Compare-And-Swap (CAS) — an atomic operation that reads, compares, and conditionally writes in a single uninterruptible hardware instruction.

🔑 CAS Semantics

The operation atomically performs:

// Conceptual (not actual C): bool CAS(int *ptr, int expected, int desired) { if (*ptr == expected) { *ptr = desired; return true; // success: we made the swap } expected = *ptr; // update caller's expected value return false; // failure: retry with new expected } // Hardware instruction: LOCK CMPXCHG on x86 // One bus cycle; no kernel involvement; no mutex needed

Retry loops (optimistic concurrency) are used when CAS fails — re-read the current value and try again. Progress is guaranteed for at least one thread in each round.

📦 C11 stdatomic.h
#include <stdatomic.h> atomic_int counter = ATOMIC_VAR_INIT(0); /* Atomic increment — replaces mutex for counters */ atomic_fetch_add(&counter, 1); /* Load / store with explicit memory order */ int val = atomic_load_explicit( &counter, memory_order_acquire); atomic_store_explicit( &counter, 42, memory_order_release); /* CAS — strong (no spurious failures) */ int expected = 0; bool ok = atomic_compare_exchange_strong( &counter, &expected, 1);
🔧 GCC __atomic Builtins (Pre-C11)
/* GCC atomic builtins — widely supported */ int old = __atomic_fetch_add( &counter, 1, __ATOMIC_SEQ_CST); int expected = 5, desired = 6; bool ok = __atomic_compare_exchange_n( &counter, &expected, desired, false, /* strong */ __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ); /* Memory orders (fastest to strongest): RELAXED → ACQUIRE/RELEASE → SEQ_CST Use ACQUIRE for loads, RELEASE for stores */
🐛 The ABA Problem

CAS checks value equality not identity. If a pointer changes A → B → A between your read and your CAS, the CAS succeeds even though the world has changed underneath you — leading to use-after-free or silent data corruption in lock-free linked lists and queues.

ProblemFix
Pointer A freed and reallocated at same addressDouble-width CAS: combine pointer + version counter (128-bit CAS on x86-64)
Node popped from stack, pushed backTagged pointer — store generation counter in low bits (alignment gives free bits)
Hazard pointer techniqueThreads publish pointers they are reading; reclamation checks the hazard list before freeing
/* Tagged pointer — 64-bit pointer, low 16 bits = generation */ typedef struct { uintptr_t ptr; uint64_t gen; } TaggedPtr; _Alignas(16) atomic_TaggedPtr head; /* needs 128-bit CAS */ /* x86-64: use CMPXCHG16B via GCC __int128 trick */
📨 Lock-Free MPSC Queue (Practical Pattern)

Multi-producer single-consumer queues are the most practical lock-free structure for server internals. Producers do a CAS on the tail; the single consumer pops without a lock (no CAS needed on head with SPSC guarantee).

typedef struct Node { void *data; _Atomic(struct Node *) next; } Node; typedef struct { _Atomic(Node *) head; /* consumer reads */ _Atomic(Node *) tail; /* producers CAS */ } MPSCQueue; void mpsc_push(MPSCQueue *q, Node *n) { atomic_store_explicit(&n->next, NULL, memory_order_relaxed); Node *prev = atomic_exchange_explicit(&q->tail, n, memory_order_acq_rel); /* Link — at most one writer per prev slot */ atomic_store_explicit(&prev->next, n, memory_order_release); }
Memory Ordering Rules of Thumb: Use memory_order_relaxed for statistics counters (no ordering needed). Use acquire/release pairs for producer-consumer handoffs. Use seq_cst only when you need a global total order — it is the slowest (full fence on x86, heavyweight on ARM).
🔀 C4 + C5 — I/O Multiplexing Evolution & the C10K Problem

I/O multiplexing lets a single thread wait on multiple file descriptors simultaneously, being notified only when one is ready — eliminating the need for a thread per connection. The Linux kernel has evolved four generations of this interface over 30 years.

APIYearNotification ModelMax FDsComplexityVerdict
select1983Bitmask scan on return1024 (FD_SETSIZE)O(n) per callLegacy only
poll1986Linear scan of pollfd arrayUnlimitedO(n) per callPortable but slow
epoll2002Kernel event queue; ready listUnlimitedO(1) add/waitLinux standard
io_uring2019Ring buffers; zero syscall per I/OUnlimitedO(1) + amortisedLinux 5.1+, future
🔵 epoll Deep Dive — the Three-Syscall API
#include <sys/epoll.h> /* 1. Create epoll instance — returns epoll fd */ int epfd = epoll_create1(EPOLL_CLOEXEC); /* 2. Register / modify / remove FDs */ struct epoll_event ev; ev.events = EPOLLIN | EPOLLET; /* EPOLLET = edge-triggered */ ev.data.fd = client_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev); /* EPOLL_CTL_MOD to change, EPOLL_CTL_DEL to remove */ /* 3. Wait — returns only ready FDs */ struct epoll_event events[128]; int n = epoll_wait(epfd, events, 128, -1); /* -1 = block forever */ for (int i = 0; i < n; i++) { handle(events[i].data.fd, events[i].events); }
Edge-Triggered (ET) vs Level-Triggered (LT)
ModeWhen kernel notifiesRead behaviour
LT (default)Every time FD is readableCan read partial — notified again
ET (EPOLLET)Only on state transition (new data arrives)Must read until EAGAIN — otherwise miss data
ET is higher performance (fewer wakeups) but demands non-blocking FDs and looping reads. LT is safer for beginners.
Non-Blocking FD Setup
static void set_nonblocking(int fd) { int flags = fcntl(fd, F_GETFL, 0); fcntl(fd, F_SETFL, flags | O_NONBLOCK); } /* ET read loop — drain until EAGAIN */ while (1) { ssize_t n = read(fd, buf, sizeof(buf)); if (n == -1 && errno == EAGAIN) break; if (n <= 0) { close(fd); break; } process(buf, n); }
🚀 io_uring — The Next Generation

Linux 5.1 (2019) introduced io_uring, built around two lock-free ring buffers shared between kernel and userspace:

Userspace Kernel SQ (Submission Queue) ─────────▶ Kernel picks up SQEs └─ SQE: op=READ, fd=5, buf=… and executes async CQ (Completion Queue) ◀───────── Kernel writes CQE: res=42 └─ CQE: user_data=…, res=42 No syscall needed per operation if SQ ring has space. io_uring_enter() batches multiple submits in one syscall. io_uring_setup() + mmap() to set up rings.
Featureepollio_uring
Syscall per I/OYes (read/write + epoll_wait)No (ring buffer, batched)
Zero-copyNoYes (fixed buffers)
Async file I/ONo (files always "ready")Yes (true async)
PortabilityLinux onlyLinux 5.1+ only
💣 C5 — The C10K Problem Explained

Dan Kegel's 1999 paper posed the question: how do you handle 10,000 simultaneous connections on a single server? At the time, the dominant model was one thread per connection. The math:

  • 10K threads × 8 KB stack = 80 MB minimum (but pages are faulted in, TLB pressure)
  • Context switch cost: ~1–10 µs each; 10K threads × 100 context switches/sec = 1–10 seconds of CPU time per second just switching
  • Kernel scheduler: O(log n) for CFS scheduler; 10K sleeping threads compete for wake-up slots

Solution: epoll event loop. One thread. No context switches between connections. O(1) delivery of ready events. Nginx handles 50K–100K concurrent connections on a single core with this model.

🗃️ C6 + C7 — In-Process Caching and Distributed Caching

Caching is the single most impactful performance optimisation available to a backend engineer. The key insight: most production workloads exhibit a power-law access pattern — 20% of keys account for 80% of requests. Keeping that hot 20% in memory eliminates most database round-trips.

🔗 C6 — LRU Cache: Hash Map + Doubly-Linked List

The classic O(1) LRU implementation combines:

  • Hash map: O(1) lookup — key → pointer to list node
  • Doubly-linked list: O(1) insert/remove; MRU at head, LRU at tail
GET "user:42" → hash_map["user:42"] → Node* ptrdoubly-linked list ▼ HEAD ←→ [user:99] ←→ [user:42] ←→ [user:01] ←→ TAIL MRU ←────────────────────────────────────── LRU After GET hit: move user:42 to HEAD HEAD ←→ [user:42] ←→ [user:99] ←→ [user:01] ←→ TAIL On capacity: evict TAIL node (user:01), remove from map
typedef struct Node { char key[64]; void *value; struct Node *prev, *next; } Node; typedef struct { Node *head, *tail; /* sentinel nodes */ int size, capacity; /* hash_map: key → Node* (uthash or open-addressing) */ } LRUCache; /* get: hash lookup → move to head → return value */ /* put: hash lookup → if hit update+move; if miss insert head; if full evict tail */
📊 LFU — Least Frequently Used

Evicts the key accessed the fewest times. Better than LRU for workloads where old-but-popular keys should stay (e.g. homepage, top products).

Implementation: min-heap keyed by frequency counter + hash map. On each access, increment freq, re-heapify. O(log n) per operation. Min-heap of doubly-linked lists (one list per frequency bucket) gives O(1).

📏 Cache Capacity Planning
MetricFormula / Target
Hit ratecache_hits / (hits + misses) → target >90%
Working set sizeunique_hot_keys × avg_value_bytes
Memory budgetJVM/process heap × 30% (leave room for GC, buffers)
TTL vs evictionTTL for correctness; capacity eviction for memory
⚡ C7 — Cache Stampede (Thundering Herd)

When a popular key's TTL expires simultaneously for N threads/processes, they all miss the cache and hammer the database together. The database sees a sudden N× spike in traffic. Three mitigations:

  • Mutex lock on cache miss: First miss acquires lock and refreshes; other waiters read the freshly populated value. Problem: lock adds latency for all waiters.
  • Probabilistic early expiry (XFetch): Randomly refresh before TTL expires — probability increases as expiry approaches. Zero extra infrastructure, no lock. Formula: if current_time - ttl * β * log(rand()) > expiry_time → refresh
  • Background refresh: Serve stale value immediately; async worker refreshes in background. Requires stale-while-revalidate semantics — acceptable when slight staleness is OK.
/* Redis SETNX-based mutex for stampede prevention */ char *cache_get_with_lock(const char *key) { char *val = redis_get(key); if (val) return val; char lock_key[128]; snprintf(lock_key, sizeof(lock_key), "lock:%s", key); if (redis_setnx(lock_key, "1", /* ttl */5)) { val = db_fetch(key); redis_set(key, val, TTL); redis_del(lock_key); } else { /* Another thread is refreshing — wait and retry */ usleep(5000); val = redis_get(key); } return val; }
🔥 Redis Hotspot Key Sharding

A single Redis key receiving millions of reads/sec overloads one shard. Solutions:

  • Local in-process replica: Read from Redis once, cache in process for 100 ms. Stale but radically cheaper.
  • Key splitting: Store hotkey:0 through hotkey:N-1; reader picks hotkey:{random 0..N-1}. Write to all N. Reads spread across N slots/shards.
  • Read replicas: Route reads to Redis replicas, writes to primary. Redis Cluster supports this natively.
📈 C8–C10 — Connection Pools, Load Balancing & Horizontal Scaling

Scaling means keeping latency flat as traffic grows. The three levers are: pooling (amortise connection setup cost), load balancing (distribute traffic optimally), and stateless design (allow trivial horizontal replication).

🔌 C8 — Connection Pool Management

Opening a TCP connection + TLS handshake + PostgreSQL auth takes 2–50 ms. A connection pool amortises this by keeping a set of pre-established connections ready to use. Key parameters:

ParameterMeaningTypical Value
pool_sizeMax simultaneous connections10–50 per service instance
min_idleKeep-warm connections2–5
acquire_timeoutWait before returning error500 ms – 2 s
max_lifetimeForce recycle (leak prevention)30 min
idle_timeoutClose idle conns above min_idle5–10 min
Pool Exhaustion: When all connections are busy and a new request arrives, options are: queue (wait up to acquire_timeout), reject immediately (return 503), or open a temporary overflow connection (be careful — can overwhelm the DB). Always emit a metric when queuing occurs; sustained queuing means the pool is undersized.

pgBouncer operates at the proxy level — thousands of app connections multiplex onto a smaller PostgreSQL connection pool (transaction-mode pooling). Reduces PostgreSQL's backend process count from thousands to tens.

⚖️ C9 — Load Balancing Algorithms
AlgorithmHowBest ForAvoid When
Round RobinRequest 1 → backend 1, request 2 → backend 2, …Homogeneous backends, uniform request durationVariable request times (some backends pile up)
Weighted Round RobinBackend with weight 2 gets 2× requestsHeterogeneous hardware (some servers are more powerful)Dynamic load changes
Least ConnectionsAlways route to backend with fewest active connectionsVariable request durations (DB queries, uploads)Requires tracking active count — adds state to LB
IP Hashhash(client_ip) mod NSession stickiness (non-Redis session stores)Horizontal scaling changes N → all sessions remapped
Consistent HashingVirtual nodes on a ring; hash(key) → nearest clockwise nodeCaching layers (Redis cluster), CDN routingVery small node counts (variance in distribution)
Consistent Hashing Ring (3 nodes, 6 virtual nodes): 0° │ B₁A₁ (120°) ──┼── (60°) C₁B₂ (180°) │ (300°) │ A₂(240°) C₂(320°) hash("user:42") = 180° → nearest clockwise = C₁ When node C is removed: only C's keys move to A; A,B unaffected. Traditional modulo: ALL keys remap when N changes.
🌐 C10 — Stateless Horizontal Scaling

The golden rule: no instance-local state. If you can kill any instance and restart it on a different host without losing data, the service is stateless and horizontally scalable.

State TypeExternalise ToNote
User sessionsRedis (with TTL)Session ID in cookie; data in Redis
File uploads / blobsS3 / object storageNever write to local disk in stateless service
Rate limit countersRedis INCR + EXPIRESliding window or token bucket in Redis
Distributed locksRedis SET NX EX / RedlockFor leader election or single-writer guarantees
Job queuesRedis Streams, SQS, KafkaWorkers are stateless consumers
Idempotency Pattern: Every mutating operation should be idempotent — safe to retry without double-effects. Use a client-generated idempotency_key (UUID) stored in the DB with a unique constraint. On retry, the server detects the duplicate key and returns the original result without re-processing.
🔧 Complete C Implementations
All examples compile on Linux with gcc -std=c11 -O2 -lpthread. The epoll server requires Linux ≥ 2.6.27.
1️⃣ Thread Pool with Work Queue
#include <pthread.h> #include <stdlib.h> #include <stdio.h> #define POOL_SIZE 8 #define QUEUE_CAP 1024 typedef void (*Task)(void *); typedef struct { Task fn; void *arg; } WorkItem; typedef struct { WorkItem queue[QUEUE_CAP]; int head, tail, count; pthread_mutex_t lock; pthread_cond_t not_empty; pthread_cond_t not_full; int shutdown; pthread_t threads[POOL_SIZE]; } ThreadPool; static void *worker(void *arg) { ThreadPool *p = arg; while (1) { pthread_mutex_lock(&p->lock); while (p->count == 0 && !p->shutdown) pthread_cond_wait(&p->not_empty, &p->lock); if (p->shutdown && p->count == 0) { pthread_mutex_unlock(&p->lock); break; } WorkItem w = p->queue[p->head]; p->head = (p->head + 1) % QUEUE_CAP; p->count--; pthread_cond_signal(&p->not_full); pthread_mutex_unlock(&p->lock); w.fn(w.arg); } return NULL; } ThreadPool *pool_create(void) { ThreadPool *p = calloc(1, sizeof(*p)); pthread_mutex_init(&p->lock, NULL); pthread_cond_init(&p->not_empty, NULL); pthread_cond_init(&p->not_full, NULL); for (int i = 0; i < POOL_SIZE; i++) pthread_create(&p->threads[i], NULL, worker, p); return p; } int pool_submit(ThreadPool *p, Task fn, void *arg) { pthread_mutex_lock(&p->lock); while (p->count == QUEUE_CAP) /* backpressure */ pthread_cond_wait(&p->not_full, &p->lock); p->queue[p->tail] = (WorkItem){fn, arg}; p->tail = (p->tail + 1) % QUEUE_CAP; p->count++; pthread_cond_signal(&p->not_empty); pthread_mutex_unlock(&p->lock); return 0; }
2️⃣ epoll Edge-Triggered Event Loop
#include <sys/epoll.h> #include <sys/socket.h> #include <netinet/in.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <stdio.h> #define MAX_EVENTS 256 #define PORT 8080 static void set_nonblocking(int fd) { fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK); } static void handle_client(int epfd, int cfd) { char buf[4096]; while (1) { /* drain until EAGAIN (ET) */ ssize_t n = read(cfd, buf, sizeof(buf)); if (n == -1) { if (errno == EAGAIN || errno == EWOULDBLOCK) break; perror("read"); close(cfd); return; } if (n == 0) { close(cfd); return; } /* EOF */ write(cfd, buf, n); /* echo */ } } int main(void) { int lfd = socket(AF_INET, SOCK_STREAM, 0); int opt = 1; setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_addr.s_addr = INADDR_ANY, .sin_port = htons(PORT) }; bind(lfd, (struct sockaddr *)&addr, sizeof(addr)); listen(lfd, SOMAXCONN); set_nonblocking(lfd); int epfd = epoll_create1(EPOLL_CLOEXEC); struct epoll_event ev = { .events = EPOLLIN, .data.fd = lfd }; epoll_ctl(epfd, EPOLL_CTL_ADD, lfd, &ev); struct epoll_event events[MAX_EVENTS]; for (;;) { int n = epoll_wait(epfd, events, MAX_EVENTS, -1); for (int i = 0; i < n; i++) { if (events[i].data.fd == lfd) { int cfd = accept(lfd, NULL, NULL); set_nonblocking(cfd); ev.events = EPOLLIN | EPOLLET; ev.data.fd = cfd; epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, &ev); } else { handle_client(epfd, events[i].data.fd); } } } return 0; }
3️⃣ Atomic Reference Counter (stdatomic.h)
#include <stdatomic.h> #include <stdlib.h> #include <assert.h> typedef struct { atomic_int refcount; char data[256]; } SharedBuffer; SharedBuffer *buf_new(const char *src) { SharedBuffer *b = malloc(sizeof(*b)); atomic_init(&b->refcount, 1); snprintf(b->data, sizeof(b->data), "%s", src); return b; } SharedBuffer *buf_retain(SharedBuffer *b) { atomic_fetch_add_explicit(&b->refcount, 1, memory_order_relaxed); return b; } void buf_release(SharedBuffer *b) { /* Release store pairs with acquire load in last retainer */ if (atomic_fetch_sub_explicit(&b->refcount, 1, memory_order_release) == 1) { atomic_thread_fence(memory_order_acquire); free(b); } } /* Lock-free CAS retry loop */ static atomic_int global_seq = 0; int increment_if_even(void) { int cur = atomic_load_explicit(&global_seq, memory_order_relaxed); do { if (cur % 2 != 0) return 0; /* odd — give up */ } while (!atomic_compare_exchange_weak_explicit( &global_seq, &cur, cur + 1, memory_order_acq_rel, memory_order_relaxed)); return 1; }
4️⃣ LRU Cache — Hash Map + Doubly-Linked List
#include <stdlib.h> #include <string.h> #define CACHE_CAP 128 #define HT_SIZE 257 /* prime */ typedef struct LRUNode { char key[64]; long value; struct LRUNode *prev, *next; struct LRUNode *ht_next; /* hash table chaining */ } LRUNode; typedef struct { LRUNode *ht[HT_SIZE]; LRUNode head, tail; /* sentinels */ int size, cap; } LRUCache; static unsigned hash_key(const char *k) { unsigned h = 5381; while (*k) h = h * 33 ^ (unsigned char)*k++; return h % HT_SIZE; } static void list_remove(LRUNode *n) { n->prev->next = n->next; n->next->prev = n->prev; } static void list_push_front(LRUCache *c, LRUNode *n) { n->next = c->head.next; n->prev = &c->head; c->head.next->prev = n; c->head.next = n; } void lru_init(LRUCache *c, int cap) { memset(c, 0, sizeof(*c)); c->cap = cap; c->head.next = &c->tail; c->tail.prev = &c->head; } long lru_get(LRUCache *c, const char *key) { LRUNode *n = c->ht[hash_key(key)]; for (; n; n = n->ht_next) if (strcmp(n->key, key) == 0) { list_remove(n); list_push_front(c, n); return n->value; } return -1; /* miss */ } void lru_put(LRUCache *c, const char *key, long val) { /* Evict LRU if at capacity */ if (c->size == c->cap) { LRUNode *lru = c->tail.prev; list_remove(lru); /* remove from hash table */ unsigned h = hash_key(lru->key); LRUNode **pp = &c->ht[h]; while (*pp != lru) pp = &(*pp)->ht_next; *pp = lru->ht_next; free(lru); c->size--; } LRUNode *n = calloc(1, sizeof(*n)); strncpy(n->key, key, sizeof(n->key) - 1); n->value = val; list_push_front(c, n); unsigned h = hash_key(key); n->ht_next = c->ht[h]; c->ht[h] = n; c->size++; }
🧪 Lab 1 — Thread Pool Benchmark: Thread-per-Request vs Pool vs Event Loop

Build three versions of an echo server and compare throughput and latency under 10K concurrent connections using wrk or ab.

1
Write server_threaded.c: accept loop spawns pthread_create for each connection. Use pthread_detach to avoid join overhead.
2
Write server_pool.c: single accept thread enqueues connections to the ThreadPool from this module. Pool has POOL_SIZE=8 workers.
3
Write server_epoll.c: single-threaded epoll event loop, edge-triggered, non-blocking. Handles reads in the same loop.
4
Stress test: wrk -t4 -c10000 -d30s http://localhost:8080/. Record req/sec, latency p50/p99, and CPU utilisation (pidstat).
5
Increase connection count to 50K. Observe where thread-per-request fails (ENOMEM, or scheduler thrash visible in htop).
6
Expected result: epoll handles 50K with ~1% CPU idle; threaded crashes or hits ulimit around 4–8K threads depending on stack size.
🧪 Lab 2 — Lock-Free Counter vs Mutex Counter Micro-Benchmark

Measure the performance difference between a mutex-protected counter and an atomic counter under high contention.

1
Write bench_counter.c: two versions of a global counter incremented 10M times across 16 threads.
2
Version A: pthread_mutex_t wrapping counter++
3
Version B: atomic_fetch_add(&counter, 1, memory_order_relaxed)
4
Compile: gcc -O2 -std=c11 bench_counter.c -lpthread -o bench. Run 5× and average. Expected: atomic 3–10× faster.
5
Add a third version: per-thread counter (no sharing), sum at the end. This shows the theoretical maximum — eliminates cache line bouncing entirely.
6
Explain the result in terms of cache coherence traffic: each atomic increment on a shared cache line triggers an MESI invalidation broadcast to all cores holding the line.
🧪 Lab 3 — In-Process LRU Cache with Thread Safety

Extend the LRU cache from the C Implementation tab to be thread-safe, then measure hit rate under a Zipf-distributed workload (80/20 rule).

1
Add a pthread_rwlock_t to the LRUCache struct. Protect lru_get with rdlock and lru_put with wrlock.
2
Write a Zipf key generator: key = rand() % (n * zipf_skew) where lower keys are accessed more often. Use 1000 unique keys.
3
Spawn 8 threads, each performing 1M get/put operations. Count hits and misses with atomic counters.
4
Vary cache capacity (32, 64, 128, 256 entries). Plot hit rate vs capacity. Observe the "knee" where doubling capacity no longer meaningfully improves hit rate.
5
Profile with perf stat: compare cache-miss events at different capacities. Observe L1/L2 hit rates alongside LRU hit rates.
🧪 Lab 4 — Consistent Hashing Implementation

Implement a consistent hash ring in C and measure key redistribution when adding/removing nodes.

1
Represent the ring as a sorted array of {hash_value, node_id} pairs. Use 150 virtual nodes per physical node (improves balance). Hash function: fnv1a(node_name + "_vnode_" + i).
2
Implement ring_add_node(ring, "node1") and ring_remove_node(ring, "node1") — insert/delete 150 entries, keeping array sorted (insertion sort or qsort).
3
Implement ring_get_node(ring, key): hash the key, binary-search the sorted array for the nearest clockwise entry.
4
Generate 100K random keys. Assign each to a 3-node ring. Add a 4th node. Count how many keys moved to the new node. Expected: ~25% (100K / 4). Verify.
5
Compare with modulo hashing: same 100K keys on 3→4 nodes. Count remapped keys. Expected: ~75% remapped (traditional modulo). Consistent hashing wins.
✅ Phase 4 Completion Checklist
Threading Models
  • Can explain thread-per-request memory/scheduling cost at 10K connections
  • Can implement a bounded thread pool with mutex + condvar work queue in C
  • Understand event loop model: single thread, non-blocking I/O, epoll notification
  • Know when to use green threads (Go/Rust) vs explicit thread pools
Synchronization Primitives
  • Can choose between mutex, RW lock, semaphore, condvar for a given use case
  • Know the deadlock recipe and how to prevent it with lock ordering
  • Always wrap pthread_cond_wait in a while loop (spurious wakeup guard)
  • Understand spinlock trade-offs: only for sub-100ns critical sections on SMP
Lock-Free Programming
  • Can explain CAS semantics: atomically read, compare, conditionally swap
  • Know the ABA problem and two mitigations (tagged pointers, hazard pointers)
  • Can use stdatomic.h: atomic_load/store/fetch_add/compare_exchange
  • Understand memory ordering: relaxed vs acquire/release vs seq_cst
I/O Multiplexing
  • Know select/poll/epoll/io_uring evolution and O(n)→O(1) improvement
  • Can implement an epoll edge-triggered event loop in C from scratch
  • Understand ET vs LT: ET requires reading until EAGAIN, LT re-notifies
  • Know io_uring ring buffer design and why it reduces syscall overhead
  • Can explain why 10K threads fail (stack, context switch cost)
Caching
  • Can implement LRU cache with O(1) get/put using doubly-linked list + hash map
  • Understand LFU vs LRU trade-offs for different access distributions
  • Can explain cache stampede and implement mutex-lock or XFetch mitigation
  • Know Redis hotspot key sharding strategies (key splitting, local replica)
Scaling
  • Know pgBouncer transaction-mode pooling and when pool exhaustion triggers
  • Can compare round-robin, least-conn, and consistent hashing for a use case
  • Understand consistent hashing: O(K/N) key redistribution vs O(K) for modulo
  • Can design a stateless service: externalise sessions, use idempotency keys