M11 — Concurrency & Performance

Phase 4 · Module 11

Concurrency & Performance

Threading models · Synchronization · Lock-free · epoll/io_uring · Caching · Load Balancing

pthreads epoll io_uring stdatomic.h Redis pgBouncer C / Linux

⚡ Why This Phase Matters

A backend that is correct but slow is a bug. Phase 4 closes the gap between writing code that works and writing code that scales. The concepts here — event loops, lock-free atomics, epoll — are what separate a 100 req/s prototype from a 100,000 req/s production service.

Everything builds on Phase 0 (Linux syscalls) and Phase 2 (TCP/HTTP): you need to understand file descriptors and sockets before epoll makes sense, and you need HTTP to understand why C10K matters.

🔗 Prerequisites

Ph0 — Linux syscalls, file descriptors, process model
Ph2 — TCP sockets, HTTP request/response lifecycle
Basic C: pointers, structs, malloc/free
M01 (networking stack) and M06 (HTTP internals) recommended

🎯 What You Will Build

Edge-triggered epoll event loop in C handling 10K+ conns
Thread pool with mutex + condition variable work queue
Lock-free MPSC queue using stdatomic.h CAS
In-process LRU cache (hash map + doubly-linked list)
Benchmark: thread-per-conn vs epoll event loop

📋 Module Roadmap — 10 Concepts

#	Concept	Tab	Key Insight
C1	Threading models	Threading	Thread-per-req wastes 8KB stack × N; event loop reuses one thread
C2	Synchronization primitives	Sync	Mutex = binary; semaphore = counting; condvar = signaling
C3	Lock-free CAS / stdatomic.h	Lock-Free	Atomic read-modify-write without kernel involvement
C4	I/O multiplexing evolution	I/O Mux	select→poll→epoll→io_uring; O(n)→O(1) notification
C5	C10K problem	I/O Mux	OS scheduling kills thread-per-conn; epoll is the fix
C6	In-process LRU/LFU caching	Caching	Hash map + doubly-linked list = O(1) get/put/evict
C7	Cache stampede / Redis	Caching	Mutex lock or probabilistic early expiry prevents thundering herd
C8	Connection pool management	Scaling	Pool exhaustion → queue with timeout + backpressure
C9	Load balancing algorithms	Scaling	Consistent hashing minimizes redistribution on node change
C10	Stateless horizontal scaling	Scaling	Externalize all state; make ops idempotent

🧵 C1 — Four Threading Models Every Backend Engineer Must Know

The threading model is the most consequential architectural decision in a concurrent server. It determines memory footprint, latency profile, CPU utilisation, and how hard the code is to reason about. There are four canonical models.

Model	How It Works	Memory / 10K Conns	Best For	Real Examples
Thread-per-request	One OS thread per active connection; blocks on I/O	~80 MB (8KB stack × 10K)	Low concurrency, simple CRUD	Apache prefork, early Java servlets
Thread pool	Fixed N threads; work queue; threads pick up tasks	~800 KB (N=100 threads)	CPU-bound tasks, mixed workloads	Apache worker MPM, gRPC server
Event loop	Single thread; I/O multiplexing (epoll); non-blocking	~kilobytes per conn	High-concurrency I/O-bound services	Nginx, Node.js, Redis
Green threads / M:N	Userspace scheduler maps M goroutines → N OS threads	~2 KB stack (growable)	Mixed I/O+CPU; simple code	Go goroutines, Rust async/tokio, Erlang

⚠️ Why Thread-per-Request Breaks at Scale

Stack memory: Linux default 8 MB ulimit, minimum 4–8 KB actual; 10K threads = 40–80 MB minimum
Context switching: Each OS thread switch costs ~1–10 µs (TLB flush, register save); 10K threads fighting for CPU = scheduling storm
Blocking I/O: Thread sits idle during syscall (read, write, accept). CPU does nothing useful
Memory bandwidth: Each sleeping thread's stack still occupies virtual memory — page faults on wake

✅ Event Loop Model Advantages

No blocking: All I/O is async; thread never waits
One thread → thousands of connections: epoll delivers ready FDs in O(1)
Cache-friendly: Single thread has hot L1/L2; no context switch overhead
Predictable latency: No lock contention between connections
Limitation: CPU-bound tasks block the loop — offload to thread pool

🔩 Thread Pool Pattern — Bounded Work Queue

Best of both worlds: event loop accepts connections, but dispatches CPU-bound work to a thread pool. The key design decisions are queue depth (prevents unbounded memory growth), number of workers (usually CPU cores × 1–2), and what to do when the queue is full (reject with 503, or block with timeout).

Incoming Requests │ ▼ [ Accept Thread ] ──── epoll event loop │ ▼ enqueue task [ Bounded Work Queue ] capacity = 1024 │ ┌───┴─────────────────────────────────┐ ▼ ▼ ▼ ▼ Worker 1 Worker 2 Worker 3 ... Worker N (thread) (thread) (thread) (thread) │ ▼ queue full? → 503 Service Unavailable (backpressure)

Analogy — Restaurant Kitchen: Thread-per-request = hire a new chef for every diner (chaos, expensive). Thread pool = hire 8 chefs, queue orders (Apache worker MPM). Event loop = one chef who never waits — checks oven, assembles plate, checks next order, never blocks (Nginx). Green threads = chef clones who can pause mid-task and swap (Go goroutines).

🔧 pthreads Quick Reference

#include <pthread.h> // Link with -lpthread /* Create a thread */ pthread_t tid; pthread_create(&tid, NULL, worker_fn, arg); pthread_join(tid, NULL); /* block until done */ pthread_detach(tid); /* auto-cleanup on exit */ /* Thread function signature */ void *worker_fn(void *arg) { int *val = (int *)arg; /* ... do work ... */ return NULL; }

🔐 C2 — Synchronization Primitives: When to Use Each

Every synchronization primitive solves a specific problem. Using the wrong one either causes deadlock, starvation, or unnecessary serialization. Master this decision table before writing any concurrent code.

Primitive	Semantics	pthreads API	Use When	Pitfall
Mutex	Binary lock — exclusive access to critical section	`pthread_mutex_t`	Protecting shared data structure (queue, hash map)	Lock ordering violations → deadlock
RW Lock	Multiple concurrent readers OR one exclusive writer	`pthread_rwlock_t`	Read-heavy workloads (config, caches, routing tables)	Writer starvation if readers never yield
Semaphore	Counting lock — allows N simultaneous holders	`sem_t` / `sem_init`	Rate limiting, connection pools, resource counting	Forgetting sem_post → deadlock; don't use as mutex
Condition Variable	Wait for a predicate; always paired with mutex	`pthread_cond_t`	Producer/consumer, work queues, thread pools	Spurious wakeups — always check predicate in loop
Spinlock	Busy-wait loop; no kernel involvement	`pthread_spinlock_t`	Very short critical sections on multi-core only	Wastes CPU if held for >~100 ns; never block inside

🔒 Mutex Pattern — Work Queue

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t cond = PTHREAD_COND_INITIALIZER; int queue_size = 0; /* Producer */ pthread_mutex_lock(&lock); enqueue(item); queue_size++; pthread_cond_signal(&cond); pthread_mutex_unlock(&lock); /* Consumer — always loop on predicate */ pthread_mutex_lock(&lock); while (queue_size == 0) /* spurious wakeup guard */ pthread_cond_wait(&cond, &lock); item = dequeue(); queue_size--; pthread_mutex_unlock(&lock);

📖 RW Lock Pattern — Config Cache

pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER; /* Multiple threads reading simultaneously */ pthread_rwlock_rdlock(&rwlock); val = config_get("max_conns"); pthread_rwlock_unlock(&rwlock); /* One thread writing (reloads config) */ pthread_rwlock_wrlock(&rwlock); config_reload(); /* exclusive */ pthread_rwlock_unlock(&rwlock);

🔢 Semaphore Pattern — Connection Pool Rate Limiter

#include <semaphore.h> sem_t pool_sem; sem_init(&pool_sem, 0, 20); /* allow 20 concurrent DB conns */ /* Acquire a slot — blocks if pool is exhausted */ sem_wait(&pool_sem); /* decrement; block if 0 */ conn = pool_acquire(); do_query(conn); pool_release(conn); sem_post(&pool_sem); /* increment; wake waiter */ /* Non-blocking try */ if (sem_trywait(&pool_sem) == -1) { /* pool full → return 503 immediately */ }

Deadlock Recipe: Lock A then B in thread 1; lock B then A in thread 2. Prevent with a global lock ordering (always acquire in same order by address or enum), or use pthread_mutex_trylock with backoff.

📐 Lock Granularity Rule

Hold locks for the shortest time possible. Never do I/O, system calls, or expensive computation while holding a mutex. Extract the data you need under the lock, release, then process. This maximises throughput and minimises tail latency.

Anti-Pattern	Fix
Call `send()` while holding mutex	Copy data out, release lock, then send
malloc inside critical section	Pre-allocate or allocate before locking
Nested locks without ordering	Define global lock hierarchy by module/enum
Spinlock on single-core machine	Use mutex — spinning just wastes quanta

⚛️ C3 — Lock-Free Programming with CAS and stdatomic.h

Lock-free algorithms allow multiple threads to make progress without mutual exclusion. The foundation is Compare-And-Swap (CAS) — an atomic operation that reads, compares, and conditionally writes in a single uninterruptible hardware instruction.

🔑 CAS Semantics

The operation atomically performs:

// Conceptual (not actual C): bool CAS(int *ptr, int expected, int desired) { if (*ptr == expected) { *ptr = desired; return true; // success: we made the swap } expected = *ptr; // update caller's expected value return false; // failure: retry with new expected } // Hardware instruction: LOCK CMPXCHG on x86 // One bus cycle; no kernel involvement; no mutex needed

Retry loops (optimistic concurrency) are used when CAS fails — re-read the current value and try again. Progress is guaranteed for at least one thread in each round.

📦 C11 stdatomic.h

#include <stdatomic.h> atomic_int counter = ATOMIC_VAR_INIT(0); /* Atomic increment — replaces mutex for counters */ atomic_fetch_add(&counter, 1); /* Load / store with explicit memory order */ int val = atomic_load_explicit( &counter, memory_order_acquire); atomic_store_explicit( &counter, 42, memory_order_release); /* CAS — strong (no spurious failures) */ int expected = 0; bool ok = atomic_compare_exchange_strong( &counter, &expected, 1);

🔧 GCC __atomic Builtins (Pre-C11)

/* GCC atomic builtins — widely supported */ int old = __atomic_fetch_add( &counter, 1, __ATOMIC_SEQ_CST); int expected = 5, desired = 6; bool ok = __atomic_compare_exchange_n( &counter, &expected, desired, false, /* strong */ __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ); /* Memory orders (fastest to strongest): RELAXED → ACQUIRE/RELEASE → SEQ_CST Use ACQUIRE for loads, RELEASE for stores */

🐛 The ABA Problem

CAS checks value equality not identity. If a pointer changes A → B → A between your read and your CAS, the CAS succeeds even though the world has changed underneath you — leading to use-after-free or silent data corruption in lock-free linked lists and queues.

Problem	Fix
Pointer A freed and reallocated at same address	Double-width CAS: combine pointer + version counter (128-bit CAS on x86-64)
Node popped from stack, pushed back	Tagged pointer — store generation counter in low bits (alignment gives free bits)
Hazard pointer technique	Threads publish pointers they are reading; reclamation checks the hazard list before freeing

/* Tagged pointer — 64-bit pointer, low 16 bits = generation */ typedef struct { uintptr_t ptr; uint64_t gen; } TaggedPtr; _Alignas(16) atomic_TaggedPtr head; /* needs 128-bit CAS */ /* x86-64: use CMPXCHG16B via GCC __int128 trick */

📨 Lock-Free MPSC Queue (Practical Pattern)

Multi-producer single-consumer queues are the most practical lock-free structure for server internals. Producers do a CAS on the tail; the single consumer pops without a lock (no CAS needed on head with SPSC guarantee).

typedef struct Node { void *data; _Atomic(struct Node *) next; } Node; typedef struct { _Atomic(Node *) head; /* consumer reads */ _Atomic(Node *) tail; /* producers CAS */ } MPSCQueue; void mpsc_push(MPSCQueue *q, Node *n) { atomic_store_explicit(&n->next, NULL, memory_order_relaxed); Node *prev = atomic_exchange_explicit(&q->tail, n, memory_order_acq_rel); /* Link — at most one writer per prev slot */ atomic_store_explicit(&prev->next, n, memory_order_release); }

Memory Ordering Rules of Thumb: Use memory_order_relaxed for statistics counters (no ordering needed). Use acquire/release pairs for producer-consumer handoffs. Use seq_cst only when you need a global total order — it is the slowest (full fence on x86, heavyweight on ARM).

🔀 C4 + C5 — I/O Multiplexing Evolution & the C10K Problem

I/O multiplexing lets a single thread wait on multiple file descriptors simultaneously, being notified only when one is ready — eliminating the need for a thread per connection. The Linux kernel has evolved four generations of this interface over 30 years.

API	Year	Notification Model	Max FDs	Complexity	Verdict
`select`	1983	Bitmask scan on return	1024 (FD_SETSIZE)	O(n) per call	Legacy only
`poll`	1986	Linear scan of pollfd array	Unlimited	O(n) per call	Portable but slow
`epoll`	2002	Kernel event queue; ready list	Unlimited	O(1) add/wait	Linux standard
`io_uring`	2019	Ring buffers; zero syscall per I/O	Unlimited	O(1) + amortised	Linux 5.1+, future

🔵 epoll Deep Dive — the Three-Syscall API

#include <sys/epoll.h> /* 1. Create epoll instance — returns epoll fd */ int epfd = epoll_create1(EPOLL_CLOEXEC); /* 2. Register / modify / remove FDs */ struct epoll_event ev; ev.events = EPOLLIN | EPOLLET; /* EPOLLET = edge-triggered */ ev.data.fd = client_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev); /* EPOLL_CTL_MOD to change, EPOLL_CTL_DEL to remove */ /* 3. Wait — returns only ready FDs */ struct epoll_event events[128]; int n = epoll_wait(epfd, events, 128, -1); /* -1 = block forever */ for (int i = 0; i < n; i++) { handle(events[i].data.fd, events[i].events); }

Edge-Triggered (ET) vs Level-Triggered (LT)

Mode	When kernel notifies	Read behaviour
LT (default)	Every time FD is readable	Can read partial — notified again
ET (`EPOLLET`)	Only on state transition (new data arrives)	Must read until `EAGAIN` — otherwise miss data

ET is higher performance (fewer wakeups) but demands non-blocking FDs and looping reads. LT is safer for beginners.

Non-Blocking FD Setup

static void set_nonblocking(int fd) { int flags = fcntl(fd, F_GETFL, 0); fcntl(fd, F_SETFL, flags | O_NONBLOCK); } /* ET read loop — drain until EAGAIN */ while (1) { ssize_t n = read(fd, buf, sizeof(buf)); if (n == -1 && errno == EAGAIN) break; if (n <= 0) { close(fd); break; } process(buf, n); }

🚀 io_uring — The Next Generation

Linux 5.1 (2019) introduced io_uring, built around two lock-free ring buffers shared between kernel and userspace:

Userspace Kernel SQ (Submission Queue) ─────────▶ Kernel picks up SQEs └─ SQE: op=READ, fd=5, buf=… and executes async CQ (Completion Queue) ◀───────── Kernel writes CQE: res=42 └─ CQE: user_data=…, res=42 No syscall needed per operation if SQ ring has space. io_uring_enter() batches multiple submits in one syscall. io_uring_setup() + mmap() to set up rings.

Feature	epoll	io_uring
Syscall per I/O	Yes (read/write + epoll_wait)	No (ring buffer, batched)
Zero-copy	No	Yes (fixed buffers)
Async file I/O	No (files always "ready")	Yes (true async)
Portability	Linux only	Linux 5.1+ only

💣 C5 — The C10K Problem Explained

Dan Kegel's 1999 paper posed the question: how do you handle 10,000 simultaneous connections on a single server? At the time, the dominant model was one thread per connection. The math:

10K threads × 8 KB stack = 80 MB minimum (but pages are faulted in, TLB pressure)
Context switch cost: ~1–10 µs each; 10K threads × 100 context switches/sec = 1–10 seconds of CPU time per second just switching
Kernel scheduler: O(log n) for CFS scheduler; 10K sleeping threads compete for wake-up slots

Solution: epoll event loop. One thread. No context switches between connections. O(1) delivery of ready events. Nginx handles 50K–100K concurrent connections on a single core with this model.

🗃️ C6 + C7 — In-Process Caching and Distributed Caching

Caching is the single most impactful performance optimisation available to a backend engineer. The key insight: most production workloads exhibit a power-law access pattern — 20% of keys account for 80% of requests. Keeping that hot 20% in memory eliminates most database round-trips.

🔗 C6 — LRU Cache: Hash Map + Doubly-Linked List

The classic O(1) LRU implementation combines:

Hash map: O(1) lookup — key → pointer to list node
Doubly-linked list: O(1) insert/remove; MRU at head, LRU at tail

GET "user:42" → hash_map["user:42"] → Node* ptr │ doubly-linked list ▼ HEAD ←→ [user:99] ←→ [user:42] ←→ [user:01] ←→ TAIL MRU ←────────────────────────────────────── LRU After GET hit: move user:42 to HEAD HEAD ←→ [user:42] ←→ [user:99] ←→ [user:01] ←→ TAIL On capacity: evict TAIL node (user:01), remove from map

typedef struct Node { char key[64]; void *value; struct Node *prev, *next; } Node; typedef struct { Node *head, *tail; /* sentinel nodes */ int size, capacity; /* hash_map: key → Node* (uthash or open-addressing) */ } LRUCache; /* get: hash lookup → move to head → return value */ /* put: hash lookup → if hit update+move; if miss insert head; if full evict tail */

📊 LFU — Least Frequently Used

Evicts the key accessed the fewest times. Better than LRU for workloads where old-but-popular keys should stay (e.g. homepage, top products).

Implementation: min-heap keyed by frequency counter + hash map. On each access, increment freq, re-heapify. O(log n) per operation. Min-heap of doubly-linked lists (one list per frequency bucket) gives O(1).

📏 Cache Capacity Planning

Metric	Formula / Target
Hit rate	cache_hits / (hits + misses) → target >90%
Working set size	unique_hot_keys × avg_value_bytes
Memory budget	JVM/process heap × 30% (leave room for GC, buffers)
TTL vs eviction	TTL for correctness; capacity eviction for memory

⚡ C7 — Cache Stampede (Thundering Herd)

When a popular key's TTL expires simultaneously for N threads/processes, they all miss the cache and hammer the database together. The database sees a sudden N× spike in traffic. Three mitigations:

Mutex lock on cache miss: First miss acquires lock and refreshes; other waiters read the freshly populated value. Problem: lock adds latency for all waiters.
Probabilistic early expiry (XFetch): Randomly refresh before TTL expires — probability increases as expiry approaches. Zero extra infrastructure, no lock. Formula: if current_time - ttl * β * log(rand()) > expiry_time → refresh
Background refresh: Serve stale value immediately; async worker refreshes in background. Requires stale-while-revalidate semantics — acceptable when slight staleness is OK.

/* Redis SETNX-based mutex for stampede prevention */ char *cache_get_with_lock(const char *key) { char *val = redis_get(key); if (val) return val; char lock_key[128]; snprintf(lock_key, sizeof(lock_key), "lock:%s", key); if (redis_setnx(lock_key, "1", /* ttl */5)) { val = db_fetch(key); redis_set(key, val, TTL); redis_del(lock_key); } else { /* Another thread is refreshing — wait and retry */ usleep(5000); val = redis_get(key); } return val; }

🔥 Redis Hotspot Key Sharding

A single Redis key receiving millions of reads/sec overloads one shard. Solutions:

Local in-process replica: Read from Redis once, cache in process for 100 ms. Stale but radically cheaper.
Key splitting: Store hotkey:0 through hotkey:N-1; reader picks hotkey:{random 0..N-1}. Write to all N. Reads spread across N slots/shards.
Read replicas: Route reads to Redis replicas, writes to primary. Redis Cluster supports this natively.

📈 C8–C10 — Connection Pools, Load Balancing & Horizontal Scaling

Scaling means keeping latency flat as traffic grows. The three levers are: pooling (amortise connection setup cost), load balancing (distribute traffic optimally), and stateless design (allow trivial horizontal replication).

🔌 C8 — Connection Pool Management

Opening a TCP connection + TLS handshake + PostgreSQL auth takes 2–50 ms. A connection pool amortises this by keeping a set of pre-established connections ready to use. Key parameters:

Parameter	Meaning	Typical Value
`pool_size`	Max simultaneous connections	10–50 per service instance
`min_idle`	Keep-warm connections	2–5
`acquire_timeout`	Wait before returning error	500 ms – 2 s
`max_lifetime`	Force recycle (leak prevention)	30 min
`idle_timeout`	Close idle conns above min_idle	5–10 min

Pool Exhaustion: When all connections are busy and a new request arrives, options are: queue (wait up to acquire_timeout), reject immediately (return 503), or open a temporary overflow connection (be careful — can overwhelm the DB). Always emit a metric when queuing occurs; sustained queuing means the pool is undersized.

pgBouncer operates at the proxy level — thousands of app connections multiplex onto a smaller PostgreSQL connection pool (transaction-mode pooling). Reduces PostgreSQL's backend process count from thousands to tens.

⚖️ C9 — Load Balancing Algorithms

Algorithm	How	Best For	Avoid When
Round Robin	Request 1 → backend 1, request 2 → backend 2, …	Homogeneous backends, uniform request duration	Variable request times (some backends pile up)
Weighted Round Robin	Backend with weight 2 gets 2× requests	Heterogeneous hardware (some servers are more powerful)	Dynamic load changes
Least Connections	Always route to backend with fewest active connections	Variable request durations (DB queries, uploads)	Requires tracking active count — adds state to LB
IP Hash	hash(client_ip) mod N	Session stickiness (non-Redis session stores)	Horizontal scaling changes N → all sessions remapped
Consistent Hashing	Virtual nodes on a ring; hash(key) → nearest clockwise node	Caching layers (Redis cluster), CDN routing	Very small node counts (variance in distribution)

Consistent Hashing Ring (3 nodes, 6 virtual nodes): 0° │ B₁ │ A₁ (120°) ──┼── (60°) C₁ │ B₂ (180°) │ (300°) │ A₂(240°) C₂(320°) hash("user:42") = 180° → nearest clockwise = C₁ When node C is removed: only C's keys move to A; A,B unaffected. Traditional modulo: ALL keys remap when N changes.

🌐 C10 — Stateless Horizontal Scaling

The golden rule: no instance-local state. If you can kill any instance and restart it on a different host without losing data, the service is stateless and horizontally scalable.

State Type	Externalise To	Note
User sessions	Redis (with TTL)	Session ID in cookie; data in Redis
File uploads / blobs	S3 / object storage	Never write to local disk in stateless service
Rate limit counters	Redis INCR + EXPIRE	Sliding window or token bucket in Redis
Distributed locks	Redis SET NX EX / Redlock	For leader election or single-writer guarantees
Job queues	Redis Streams, SQS, Kafka	Workers are stateless consumers

Idempotency Pattern: Every mutating operation should be idempotent — safe to retry without double-effects. Use a client-generated idempotency_key (UUID) stored in the DB with a unique constraint. On retry, the server detects the duplicate key and returns the original result without re-processing.

🔧 Complete C Implementations

All examples compile on Linux with gcc -std=c11 -O2 -lpthread. The epoll server requires Linux ≥ 2.6.27.

1️⃣ Thread Pool with Work Queue

#include <pthread.h> #include <stdlib.h> #include <stdio.h> #define POOL_SIZE 8 #define QUEUE_CAP 1024 typedef void (*Task)(void *); typedef struct { Task fn; void *arg; } WorkItem; typedef struct { WorkItem queue[QUEUE_CAP]; int head, tail, count; pthread_mutex_t lock; pthread_cond_t not_empty; pthread_cond_t not_full; int shutdown; pthread_t threads[POOL_SIZE]; } ThreadPool; static void *worker(void *arg) { ThreadPool *p = arg; while (1) { pthread_mutex_lock(&p->lock); while (p->count == 0 && !p->shutdown) pthread_cond_wait(&p->not_empty, &p->lock); if (p->shutdown && p->count == 0) { pthread_mutex_unlock(&p->lock); break; } WorkItem w = p->queue[p->head]; p->head = (p->head + 1) % QUEUE_CAP; p->count--; pthread_cond_signal(&p->not_full); pthread_mutex_unlock(&p->lock); w.fn(w.arg); } return NULL; } ThreadPool *pool_create(void) { ThreadPool *p = calloc(1, sizeof(*p)); pthread_mutex_init(&p->lock, NULL); pthread_cond_init(&p->not_empty, NULL); pthread_cond_init(&p->not_full, NULL); for (int i = 0; i < POOL_SIZE; i++) pthread_create(&p->threads[i], NULL, worker, p); return p; } int pool_submit(ThreadPool *p, Task fn, void *arg) { pthread_mutex_lock(&p->lock); while (p->count == QUEUE_CAP) /* backpressure */ pthread_cond_wait(&p->not_full, &p->lock); p->queue[p->tail] = (WorkItem){fn, arg}; p->tail = (p->tail + 1) % QUEUE_CAP; p->count++; pthread_cond_signal(&p->not_empty); pthread_mutex_unlock(&p->lock); return 0; }

2️⃣ epoll Edge-Triggered Event Loop

#include <sys/epoll.h> #include <sys/socket.h> #include <netinet/in.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <stdio.h> #define MAX_EVENTS 256 #define PORT 8080 static void set_nonblocking(int fd) { fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_NONBLOCK); } static void handle_client(int epfd, int cfd) { char buf[4096]; while (1) { /* drain until EAGAIN (ET) */ ssize_t n = read(cfd, buf, sizeof(buf)); if (n == -1) { if (errno == EAGAIN || errno == EWOULDBLOCK) break; perror("read"); close(cfd); return; } if (n == 0) { close(cfd); return; } /* EOF */ write(cfd, buf, n); /* echo */ } } int main(void) { int lfd = socket(AF_INET, SOCK_STREAM, 0); int opt = 1; setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_addr.s_addr = INADDR_ANY, .sin_port = htons(PORT) }; bind(lfd, (struct sockaddr *)&addr, sizeof(addr)); listen(lfd, SOMAXCONN); set_nonblocking(lfd); int epfd = epoll_create1(EPOLL_CLOEXEC); struct epoll_event ev = { .events = EPOLLIN, .data.fd = lfd }; epoll_ctl(epfd, EPOLL_CTL_ADD, lfd, &ev); struct epoll_event events[MAX_EVENTS]; for (;;) { int n = epoll_wait(epfd, events, MAX_EVENTS, -1); for (int i = 0; i < n; i++) { if (events[i].data.fd == lfd) { int cfd = accept(lfd, NULL, NULL); set_nonblocking(cfd); ev.events = EPOLLIN | EPOLLET; ev.data.fd = cfd; epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, &ev); } else { handle_client(epfd, events[i].data.fd); } } } return 0; }

3️⃣ Atomic Reference Counter (stdatomic.h)

#include <stdatomic.h> #include <stdlib.h> #include <assert.h> typedef struct { atomic_int refcount; char data[256]; } SharedBuffer; SharedBuffer *buf_new(const char *src) { SharedBuffer *b = malloc(sizeof(*b)); atomic_init(&b->refcount, 1); snprintf(b->data, sizeof(b->data), "%s", src); return b; } SharedBuffer *buf_retain(SharedBuffer *b) { atomic_fetch_add_explicit(&b->refcount, 1, memory_order_relaxed); return b; } void buf_release(SharedBuffer *b) { /* Release store pairs with acquire load in last retainer */ if (atomic_fetch_sub_explicit(&b->refcount, 1, memory_order_release) == 1) { atomic_thread_fence(memory_order_acquire); free(b); } } /* Lock-free CAS retry loop */ static atomic_int global_seq = 0; int increment_if_even(void) { int cur = atomic_load_explicit(&global_seq, memory_order_relaxed); do { if (cur % 2 != 0) return 0; /* odd — give up */ } while (!atomic_compare_exchange_weak_explicit( &global_seq, &cur, cur + 1, memory_order_acq_rel, memory_order_relaxed)); return 1; }

4️⃣ LRU Cache — Hash Map + Doubly-Linked List

#include <stdlib.h> #include <string.h> #define CACHE_CAP 128 #define HT_SIZE 257 /* prime */ typedef struct LRUNode { char key[64]; long value; struct LRUNode *prev, *next; struct LRUNode *ht_next; /* hash table chaining */ } LRUNode; typedef struct { LRUNode *ht[HT_SIZE]; LRUNode head, tail; /* sentinels */ int size, cap; } LRUCache; static unsigned hash_key(const char *k) { unsigned h = 5381; while (*k) h = h * 33 ^ (unsigned char)*k++; return h % HT_SIZE; } static void list_remove(LRUNode *n) { n->prev->next = n->next; n->next->prev = n->prev; } static void list_push_front(LRUCache *c, LRUNode *n) { n->next = c->head.next; n->prev = &c->head; c->head.next->prev = n; c->head.next = n; } void lru_init(LRUCache *c, int cap) { memset(c, 0, sizeof(*c)); c->cap = cap; c->head.next = &c->tail; c->tail.prev = &c->head; } long lru_get(LRUCache *c, const char *key) { LRUNode *n = c->ht[hash_key(key)]; for (; n; n = n->ht_next) if (strcmp(n->key, key) == 0) { list_remove(n); list_push_front(c, n); return n->value; } return -1; /* miss */ } void lru_put(LRUCache *c, const char *key, long val) { /* Evict LRU if at capacity */ if (c->size == c->cap) { LRUNode *lru = c->tail.prev; list_remove(lru); /* remove from hash table */ unsigned h = hash_key(lru->key); LRUNode **pp = &c->ht[h]; while (*pp != lru) pp = &(*pp)->ht_next; *pp = lru->ht_next; free(lru); c->size--; } LRUNode *n = calloc(1, sizeof(*n)); strncpy(n->key, key, sizeof(n->key) - 1); n->value = val; list_push_front(c, n); unsigned h = hash_key(key); n->ht_next = c->ht[h]; c->ht[h] = n; c->size++; }

🧪 Lab 1 — Thread Pool Benchmark: Thread-per-Request vs Pool vs Event Loop

Build three versions of an echo server and compare throughput and latency under 10K concurrent connections using wrk or ab.

Write server_threaded.c: accept loop spawns pthread_create for each connection. Use pthread_detach to avoid join overhead.

Write server_pool.c: single accept thread enqueues connections to the ThreadPool from this module. Pool has POOL_SIZE=8 workers.

Write server_epoll.c: single-threaded epoll event loop, edge-triggered, non-blocking. Handles reads in the same loop.

Stress test: wrk -t4 -c10000 -d30s http://localhost:8080/. Record req/sec, latency p50/p99, and CPU utilisation (pidstat).

Increase connection count to 50K. Observe where thread-per-request fails (ENOMEM, or scheduler thrash visible in htop).

Expected result: epoll handles 50K with ~1% CPU idle; threaded crashes or hits ulimit around 4–8K threads depending on stack size.

🧪 Lab 2 — Lock-Free Counter vs Mutex Counter Micro-Benchmark

Measure the performance difference between a mutex-protected counter and an atomic counter under high contention.

Write bench_counter.c: two versions of a global counter incremented 10M times across 16 threads.

Version A: pthread_mutex_t wrapping counter++

Version B: atomic_fetch_add(&counter, 1, memory_order_relaxed)

Compile: gcc -O2 -std=c11 bench_counter.c -lpthread -o bench. Run 5× and average. Expected: atomic 3–10× faster.

Add a third version: per-thread counter (no sharing), sum at the end. This shows the theoretical maximum — eliminates cache line bouncing entirely.

Explain the result in terms of cache coherence traffic: each atomic increment on a shared cache line triggers an MESI invalidation broadcast to all cores holding the line.

🧪 Lab 3 — In-Process LRU Cache with Thread Safety

Extend the LRU cache from the C Implementation tab to be thread-safe, then measure hit rate under a Zipf-distributed workload (80/20 rule).

Add a pthread_rwlock_t to the LRUCache struct. Protect lru_get with rdlock and lru_put with wrlock.

Write a Zipf key generator: key = rand() % (n * zipf_skew) where lower keys are accessed more often. Use 1000 unique keys.

Spawn 8 threads, each performing 1M get/put operations. Count hits and misses with atomic counters.

Vary cache capacity (32, 64, 128, 256 entries). Plot hit rate vs capacity. Observe the "knee" where doubling capacity no longer meaningfully improves hit rate.

Profile with perf stat: compare cache-miss events at different capacities. Observe L1/L2 hit rates alongside LRU hit rates.

🧪 Lab 4 — Consistent Hashing Implementation

Implement a consistent hash ring in C and measure key redistribution when adding/removing nodes.

Represent the ring as a sorted array of {hash_value, node_id} pairs. Use 150 virtual nodes per physical node (improves balance). Hash function: fnv1a(node_name + "_vnode_" + i).

Implement ring_add_node(ring, "node1") and ring_remove_node(ring, "node1") — insert/delete 150 entries, keeping array sorted (insertion sort or qsort).

Implement ring_get_node(ring, key): hash the key, binary-search the sorted array for the nearest clockwise entry.

Generate 100K random keys. Assign each to a 3-node ring. Add a 4th node. Count how many keys moved to the new node. Expected: ~25% (100K / 4). Verify.

Compare with modulo hashing: same 100K keys on 3→4 nodes. Count remapped keys. Expected: ~75% remapped (traditional modulo). Consistent hashing wins.

✅ Phase 4 Completion Checklist

Threading Models

Can explain thread-per-request memory/scheduling cost at 10K connections
Can implement a bounded thread pool with mutex + condvar work queue in C
Understand event loop model: single thread, non-blocking I/O, epoll notification
Know when to use green threads (Go/Rust) vs explicit thread pools

Synchronization Primitives

Can choose between mutex, RW lock, semaphore, condvar for a given use case
Know the deadlock recipe and how to prevent it with lock ordering
Always wrap pthread_cond_wait in a while loop (spurious wakeup guard)
Understand spinlock trade-offs: only for sub-100ns critical sections on SMP

Lock-Free Programming

Can explain CAS semantics: atomically read, compare, conditionally swap
Know the ABA problem and two mitigations (tagged pointers, hazard pointers)
Can use stdatomic.h: atomic_load/store/fetch_add/compare_exchange
Understand memory ordering: relaxed vs acquire/release vs seq_cst

I/O Multiplexing

Know select/poll/epoll/io_uring evolution and O(n)→O(1) improvement
Can implement an epoll edge-triggered event loop in C from scratch
Understand ET vs LT: ET requires reading until EAGAIN, LT re-notifies
Know io_uring ring buffer design and why it reduces syscall overhead
Can explain why 10K threads fail (stack, context switch cost)

Caching

Can implement LRU cache with O(1) get/put using doubly-linked list + hash map
Understand LFU vs LRU trade-offs for different access distributions
Can explain cache stampede and implement mutex-lock or XFetch mitigation
Know Redis hotspot key sharding strategies (key splitting, local replica)

Scaling

Know pgBouncer transaction-mode pooling and when pool exhaustion triggers
Can compare round-robin, least-conn, and consistent hashing for a use case
Understand consistent hashing: O(K/N) key redistribution vs O(K) for modulo
Can design a stateless service: externalise sessions, use idempotency keys

← M09 Auth & JWT ↑ Roadmap M13 Messaging →