M15 — Microservices & Infrastructure

Phase 6 Service architecture decisions · API Gateway & service discovery · Circuit breaker & bulkhead · Docker multi-stage builds · Kubernetes fundamentals · CI/CD pipelines · 12-Factor App

🏗️ Monolith vs Microservices: The Real Decision

The default answer is start with a monolith — specifically a modular monolith with clean internal boundaries. Split only when you have a concrete reason to, not because microservices are trendy.

When microservices make sense:

Team topology: Conway's Law — your system architecture mirrors your communication structure. If you have 5 independent teams, a monolith creates coordination overhead; separate services let teams deploy independently.
Independent scaling: one component (e.g., image processing) needs 10× more resources than others — split it to scale independently
Technology heterogeneity: ML model serving needs Python, low-latency trading needs C — different services, different stacks
Fault isolation: a crash in recommendations shouldn't crash checkout

Microservices costs you must accept:

Network latency and reliability in every inter-service call
Distributed tracing, log aggregation, and health monitoring for N services
Data consistency without distributed transactions (Saga, Outbox)
Deployment pipeline for each service

Analogy — Bounded Contexts (DDD):
In an e-commerce domain, "Customer" means something different to the Billing context (credit card, payment history) vs the Shipping context (address, preferred carrier). Each bounded context defines its own model of "Customer" — and each maps to a microservice boundary. Crossing context boundaries requires an explicit translation (anti-corruption layer).

🌱 Modular Monolith First

Before splitting, enforce module boundaries inside the monolith:

Each module has a public API (headers/interfaces) — no reaching into internals
Modules do not share database tables across boundaries
Cross-module calls are synchronous function calls — trivially refactorable to HTTP/gRPC later
Modules can be extracted one at a time (Strangler Fig)

If your monolith has clean module boundaries, extracting a service is a lift-and-shift. If it's a big ball of mud, microservices just distribute the mess over a network.

🪴 Strangler Fig Pattern

Incrementally replace a monolith without a big-bang rewrite:

Identify a bounded context to extract (e.g., Notifications)
Build the new service alongside the monolith
Route specific endpoints (/notify/*) through the API Gateway to the new service
Monolith still handles everything else — both coexist
Once new service is stable, remove the monolith's notification module
Repeat for next module

📐 Phase 6 Module Map

Module	Topic	Key Concepts
M15 (this)	Microservices & Infrastructure	Architecture decisions, API Gateway, circuit breaker, Docker, K8s, CI/CD
M16	Service Mesh & Advanced Infra	Istio/Envoy, mTLS, traffic shaping, Helm, Terraform IaC

Prerequisites: Ph3 (Auth — JWT validation at the gateway), Ph5 (Event-Driven — async inter-service communication, Outbox pattern)

🔗 Sync vs Async Inter-Service Communication

Dimension	Synchronous (REST/gRPC)	Asynchronous (Events/Queues)
Coupling	Temporal: caller blocks until callee responds	Loose: caller fires and continues
Latency	Fast for simple request/reply	Adds queuing delay (ms–seconds)
Failure propagation	Downstream failure cascades upstream	Broker buffers; caller unaffected by consumer down
Consistency	Immediate	Eventual
Observability	Easy: request trace follows call chain	Harder: events fan out; need correlation IDs
Best for	Queries, user-facing reads, RPC	Side effects (email, analytics, downstream processing)

Hybrid pattern: Use sync for the user-facing response (place order → return order ID immediately), then async for all side effects (charge payment, send confirmation, update analytics) via events.

🌐 REST Design for Microservices

Versioned endpoints: /v1/orders — never break existing consumers
Idempotency keys on POST: Idempotency-Key: {uuid} header
Pagination: cursor-based over offset (stable under inserts)
Timeout headers: Request-Timeout: 5000 — avoid indefinite waits
Structured error responses: {"error":"NOT_FOUND","message":"..."}
Health endpoints: /health/live (process alive), /health/ready (dependencies healthy)

⚡ gRPC for Internal Services

gRPC is preferred over REST for internal service-to-service calls:

Binary (Protobuf): smaller payload vs JSON, faster serialization
Typed contracts: .proto file is the source of truth — no schema drift
Streaming: server-side, client-side, and bidirectional streaming
HTTP/2: multiplexed connections, header compression
Code generation: auto-generated client/server stubs in any language

Use REST for external-facing APIs (browsers, third parties). Use gRPC for internal service mesh.

🚪 API Gateway Responsibilities

The API Gateway is the single entry point for all client traffic. It handles cross-cutting concerns so individual services don't have to:

Responsibility	How
Routing	Path-based: `/orders/` → Order Service, `/users/` → User Service
Auth offload	Validate JWT at gateway; forward `X-User-Id` header to services — services trust the header
Rate limiting	Token bucket per client IP or API key; return `429 Too Many Requests`
SSL termination	HTTPS at gateway; plain HTTP on internal network (mTLS for higher security)
Request aggregation (BFF)	Backend For Frontend: gateway calls 3 services and merges response — saves mobile client from 3 round trips
Canary routing	Route 5% of traffic to new service version by header/cookie — gradual rollout
Observability	Add `X-Request-Id` header; log request/response at entry point

🔍 Service Discovery

Services are ephemeral — IPs change when containers restart. Service discovery provides stable addressing.

Client-side discovery (Consul/Eureka)

Service registers itself with registry on startup
Client queries registry → gets list of healthy instances → client-side load balances (round-robin, etc.)
More control but client must implement discovery logic

Server-side discovery (AWS ALB, Kubernetes)

Client sends request to load balancer
LB queries registry and forwards to healthy instance
Client is simple; LB handles all discovery

DNS-based (Kubernetes Services): Kubernetes injects a DNS name for every Service (orders.default.svc.cluster.local). kube-proxy maintains iptables rules that load-balance across healthy Pods. Client just talks to the DNS name — no discovery library needed.

🔀 Correlation ID: Tracing Async Request Chains

When a request fans out across services (sync) or events (async), a correlation ID ties all logs together:

/* At API Gateway: generate if not present */ const char *corr_id = get_header(req, "X-Correlation-Id"); if (!corr_id) corr_id = generate_uuid(); set_header(req, "X-Correlation-Id", corr_id); /* Each service: propagate to outgoing calls AND log every event */ log_info("correlation_id=%s action=order_placed order_id=%s", corr_id, order_id); /* Each event published to Kafka: embed correlation_id in headers */ rd_kafka_headers_add(headers, "correlation_id", strlen("correlation_id"), corr_id, strlen(corr_id));

When debugging a production issue, search all service logs by correlation ID to reconstruct the full request timeline across service boundaries and async event chains.

⚡ Why Resilience Patterns Are Necessary

In a microservices system, any service call can fail or slow down. Without resilience patterns, one slow service causes a cascade failure: upstream services pile up blocked threads waiting for the slow service → thread pool exhaustion → entire system down.

The four primary resilience patterns:

Pattern	Problem Solved
Timeout	Don't wait forever — bound the worst case latency
Retry	Transient failures (network blip) often self-resolve — retry with backoff
Circuit Breaker	Stop calling a failing service — give it time to recover, fail fast to callers
Bulkhead	Isolate resource pools — one slow service can't exhaust all threads

🔌 Circuit Breaker — State Machine

Named after electrical circuit breakers that trip to prevent damage. Three states:

┌──────────────────────────────────────────────────────────────────┐ │ CLOSED (normal operation) │ │ Requests pass through. Count consecutive failures. │ │ failure_count >= threshold (e.g. 5 in 10s) → OPEN │ └──────────────────────────────────────────────────────────────────┘ │ failures exceed threshold ▼ ┌──────────────────────────────────────────────────────────────────┐ │ OPEN (fail fast) │ │ ALL requests rejected immediately (no call to downstream). │ │ Returns cached/fallback response or error. │ │ After timeout (e.g. 30s) → HALF-OPEN │ └──────────────────────────────────────────────────────────────────┘ │ recovery timeout elapsed ▼ ┌──────────────────────────────────────────────────────────────────┐ │ HALF-OPEN (probe) │ │ Allow N probe requests through. │ │ All probes succeed → CLOSED │ │ Any probe fails → OPEN (reset timer) │ └──────────────────────────────────────────────────────────────────┘

🔑 Circuit Breaker: Key Configuration Parameters

Parameter	What It Controls	Guidance
`failure_threshold`	N failures to trip OPEN	5–10 over a rolling window (not total)
`failure_rate_threshold`	% failure rate to trip (more robust than count)	50% failure rate over last 20 requests
`open_timeout`	How long to stay OPEN before probing	30s–60s, or exponential backoff
`half_open_max_calls`	Max probe calls in HALF-OPEN	1–3 probes; don't flood recovering service
`slow_call_threshold`	Calls slower than N ms count as failures	Set to 2× normal p99 latency
Fallback	What to return in OPEN state	Cached response, degraded response, or structured error

Don't set timeouts too generously. If your circuit breaker timeout is 30s but your HTTP timeout is 60s, threads still block 30–60s before the breaker opens. Always set HTTP timeout ≤ circuit breaker slow_call_threshold.

🚢 Bulkhead Pattern

Named after the watertight compartments in a ship — if one compartment floods, others are sealed off and the ship survives.

In microservices: instead of one shared thread pool for all downstream calls, create separate thread pools per dependency:

/* Thread pool bulkhead: separate pool per downstream service */ typedef struct { pthread_t threads[POOL_SIZE]; work_queue_t queue; const char *name; /* e.g. "payment-service" */ int max_queue; /* reject if queue full */ } bulkhead_pool_t; /* Separate pools: payment can be slow without blocking inventory calls */ bulkhead_pool_t payment_pool = { .name="payment", .max_queue=50 }; bulkhead_pool_t inventory_pool = { .name="inventory", .max_queue=200 };

If Payment Service slows down and fills the payment pool queue, the system returns 503 Service Unavailable for payment calls only. Inventory calls proceed normally — the bulkhead contains the failure.

🔁 Retry with Exponential Backoff + Jitter

Retries are essential for transient failures, but naive retries can cause thundering herd — hundreds of services all retry at the same second and overwhelm the recovering service.

Solution: exponential backoff + random jitter:

/* Retry with exponential backoff and full jitter */ int retry_with_backoff(int (*fn)(void*), void *ctx, int max_attempts, int base_ms) { for (int attempt = 0; attempt < max_attempts; attempt++) { if (fn(ctx) == 0) return 0; /* success */ if (attempt + 1 == max_attempts) break; /* Exponential: base_ms * 2^attempt, capped at 30s */ int cap = base_ms * (1 << attempt); if (cap > 30000) cap = 30000; /* Full jitter: random in [0, cap] — spreads retries */ int delay = rand() % (cap + 1); usleep(delay * 1000); } return -1; /* all attempts failed */ } /* Usage: retry up to 5 times, starting at 100ms base delay */ retry_with_backoff(call_payment_service, &ctx, 5, 100);

Only retry idempotent operations. Never blindly retry a POST that creates a resource — you'll create duplicates. Use idempotency keys (M13) to make POSTs safe to retry.

🐳 Multi-Stage Docker Build for C/C++

A C binary compiled in a full build image can run in a minimal runtime image. Multi-stage builds separate compilation from runtime, dramatically reducing image size (from ~1.2GB to ~20MB):

📄 Dockerfile — Multi-Stage C/C++ Build

# Stage 1: Builder — compile the binary FROM gcc:13-bookworm AS builder WORKDIR /build # Install only build dependencies RUN apt-get update && apt-get install -y --no-install-recommends \ librdkafka-dev \ libssl-dev \ libpq-dev \ cmake \ && rm -rf /var/lib/apt/lists/* # Copy source COPY . . # Compile — statically link where possible for portable binary RUN cmake -DCMAKE_BUILD_TYPE=Release -B build . \ && cmake --build build --target order_service -j$(nproc) ############################################################ # Stage 2: Runtime — minimal image, just the binary FROM debian:bookworm-slim # Install only runtime libraries (no compilers, headers, or build tools) RUN apt-get update && apt-get install -y --no-install-recommends \ librdkafka1 \ libssl3 \ libpq5 \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Security: run as non-root RUN useradd -r -s /bin/false appuser USER appuser WORKDIR /app # Copy only the compiled binary from the builder stage COPY --from=builder /build/build/order_service /app/order_service # ENTRYPOINT: exec form — PID 1 gets signals properly (SIGTERM for graceful shutdown) ENTRYPOINT ["/app/order_service"] # Default arguments (overridable at runtime) CMD ["--port=8080"]

✅ Docker Best Practices

Pin image versions: debian:bookworm-slim not debian:latest — reproducible builds
Non-root user: useradd -r + USER appuser — container escape with root = host root
No secrets in image: use environment variables or secrets mounts, never ARG PASSWORD (visible in layers)
COPY specific files: COPY src/ /build/src/ not COPY . . — avoids copying .git, local configs
Read-only root filesystem: --read-only flag — forces explicit volume mounts for writable paths
Health check: HEALTHCHECK CMD curl -f http://localhost:8080/health/live || exit 1

🚫 Common Docker Mistakes

Running as root (default if no USER set)
Using latest tag — non-deterministic; breaks reproducibility
Building in a single stage — final image carries GCC, headers, build tools
Putting secrets in environment variables that get logged
Using CMD instead of ENTRYPOINT — docker stop doesn't send SIGTERM to PID 1
apt-get update without && apt-get install in same RUN — stale layer cache
Not adding a .dockerignore — copies node_modules/, .git/, build artifacts

📄 .dockerignore

# .dockerignore — keep build context small and clean .git .gitignore .github build/ *.o *.a *.so cmake-build-debug/ CMakeCache.txt CMakeFiles/ .env .env.* *.md docs/ tests/

Every byte in the build context is sent to the Docker daemon. Large build contexts (accidental .git inclusion) slow down every build. A good .dockerignore is as important as the Dockerfile itself.

🔄 ENTRYPOINT vs CMD: Graceful Shutdown

When Kubernetes sends SIGTERM (graceful shutdown), the signal goes to PID 1 in the container. If your process is not PID 1, it never receives SIGTERM and gets hard-killed after terminationGracePeriodSeconds.

Form	Shell	PID 1	Gets SIGTERM?
`ENTRYPOINT ["/app/service"]` (exec)	No	Your binary	✅ Yes
`ENTRYPOINT /app/service` (shell)	/bin/sh -c	sh	❌ No (sh is PID 1)
`CMD ["/app/service"]` (exec)	No	Your binary	✅ Yes (if no ENTRYPOINT)

Always use exec form: ENTRYPOINT ["/app/service"]. In your C process, register a SIGTERM handler that drains connections and exits cleanly.

☸️ Kubernetes Core Objects

Object	Purpose	Key Fields
Pod	Smallest deployable unit: one or more containers sharing network/storage	`spec.containers[].image`, `resources`, `env`
Deployment	Declares desired state: N replicas of a Pod template; manages rolling updates and rollback	`spec.replicas`, `spec.strategy`, `spec.template`
Service	Stable DNS name + ClusterIP that load-balances across matching Pods (by label selector)	`spec.selector`, `spec.ports`, `spec.type`
Ingress	HTTP/S routing rules: hostname/path → Service; TLS termination	`spec.rules[].host`, `spec.tls`
ConfigMap	Non-sensitive configuration: mounted as env vars or files	`data` key-value pairs
Secret	Sensitive data (passwords, tokens): base64-encoded, encrypted at rest	`data` (base64), `type`
HPA	Horizontal Pod Autoscaler: scales replicas based on CPU/memory/custom metrics	`spec.minReplicas`, `spec.maxReplicas`, `spec.metrics`

📄 Kubernetes Deployment — C Service Example

# order-service Deployment + Service apiVersion: apps/v1 kind: Deployment metadata: name: order-service labels: app: order-service spec: replicas: 3 selector: matchLabels: app: order-service strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # at most 1 Pod down during update maxSurge: 1 # at most 1 extra Pod during update template: metadata: labels: app: order-service spec: containers: - name: order-service image: registry.example.com/order-service:1.4.2 ports: - containerPort: 8080 env: - name: DATABASE_URL valueFrom: secretKeyRef: name: order-service-secrets key: database_url - name: KAFKA_BROKERS valueFrom: configMapKeyRef: name: order-service-config key: kafka_brokers resources: requests: cpu: "100m" # 0.1 CPU cores guaranteed memory: "64Mi" limits: cpu: "500m" # burst up to 0.5 CPU memory: "256Mi" # OOM-killed if exceeded livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 3 # restart after 3 consecutive failures readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 3 periodSeconds: 5 failureThreshold: 2 # remove from LB after 2 failures terminationGracePeriodSeconds: 30 --- apiVersion: v1 kind: Service metadata: name: order-service spec: selector: app: order-service ports: - port: 80 targetPort: 8080

💓 Liveness vs Readiness Probes

Probe	Failure Action	Purpose
Liveness	Restart the container	Is the process alive? (Detects deadlocks, infinite loops)
Readiness	Remove from Service endpoints (stops traffic)	Is the process ready to serve? (DB connected, cache warm)
Startup	Restart if not ready within window	Slow-starting apps — disables liveness until startup complete

Never make readiness probe depend on external services. If Payment Service is down, you don't want all Order Service pods removed from load balancing — implement graceful degradation instead.

🚀 Rolling Updates & Rollback

Update image: kubectl set image deployment/order-service order-service=registry.example.com/order-service:1.4.3
Monitor rollout: kubectl rollout status deployment/order-service
Rollback to previous: kubectl rollout undo deployment/order-service
Rollback to specific revision: kubectl rollout undo deployment/order-service --to-revision=2

Pod Disruption Budget: minAvailable: 2 — cluster autoscaler and rolling updates respect this; never takes down so many pods that fewer than 2 are available.

🔄 CI/CD Pipeline Stages

A complete pipeline runs on every commit and gates production deployment behind automated quality checks:

1
Lint & Static Analysis — clang-tidy, cppcheck, clang-format check. Fail fast: bad code never reaches tests. (~30s)
2
Unit Tests — fast, isolated tests with mocked dependencies. Target: >80% coverage on core business logic. (~2m)
3
Integration Tests — spin up Postgres, Kafka, Redis via Docker Compose; test real service behavior against real dependencies. (~5m)
4
Security Scan — Trivy scans for CVEs in base image and dependencies; Semgrep for security antipatterns in code. Block on HIGH/CRITICAL CVEs.
5
Build OCI Image — multi-stage Docker build. Tag with git SHA: registry/service:abc1234. SHA tags are immutable — never use :latest in production.
6
Push to Registry — push to container registry. Sign image with cosign for supply chain security.
7
Deploy to Staging — kubectl set image or Helm upgrade. Run smoke tests against staging URL.
8
Deploy to Production — manual approval gate (or auto on green staging). Blue-green or canary rollout. Monitor error rate + latency for 10 minutes.

🟢🔵 Blue-Green Deployment

Maintain two identical environments (blue = current, green = new):

Deploy new version to green environment
Run smoke tests on green (not receiving production traffic)
Switch load balancer to point to green (instant cutover)
Blue environment kept running for instant rollback
After confidence period, decommission blue

Pros: zero-downtime, instant rollback
Cons: requires 2× infrastructure during transition

🐦 Canary Deployment

Route a small percentage of traffic to new version first:

Deploy new version alongside old; route 5% of traffic to it
Monitor error rate, latency, business metrics (conversion rate)
If healthy after 10m: increase to 20% → 50% → 100%
If issues: instant rollback by routing 100% back to old version

Pros: real production traffic validation, minimal blast radius
Cons: two versions run simultaneously — must be API-compatible

📄 GitHub Actions Example — C Service CI Pipeline

# .github/workflows/ci.yml name: CI on: push: branches: [main] pull_request: jobs: build-test: runs-on: ubuntu-latest services: postgres: image: postgres:16 env: POSTGRES_PASSWORD: test POSTGRES_DB: testdb options: >- --health-cmd pg_isready --health-interval 10s steps: - uses: actions/checkout@v4 - name: Install dependencies run: | sudo apt-get update sudo apt-get install -y libpq-dev librdkafka-dev clang-tidy cppcheck - name: Configure run: cmake -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_TESTS=ON - name: Build run: cmake --build build -j$(nproc) - name: Lint run: clang-tidy src/*.c -- -Iinclude - name: Unit tests run: ./build/tests/unit_tests - name: Integration tests env: DATABASE_URL: postgres://postgres:test@localhost/testdb run: ./build/tests/integration_tests docker-build: needs: build-test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - name: Build and push image run: | docker build -t registry.example.com/order-service:$ . docker push registry.example.com/order-service:$ - name: Deploy to staging run: | kubectl set image deployment/order-service \ order-service=registry.example.com/order-service:$ kubectl rollout status deployment/order-service --timeout=5m

📋 The 12-Factor App Methodology

A methodology for building software-as-a-service apps that are portable, scalable, and maintainable. Originally from Heroku — now the standard for cloud-native services.

🔑 The 12 Factors (Microservices-Relevant Highlights)

#	Factor	Rule	C Implementation
1	Codebase	One codebase per service, tracked in version control	One git repo per service; `main` deploys to production
2	Dependencies	Declare and isolate all dependencies explicitly	CMakeLists.txt pins exact library versions; no implicit system libraries
3	Config	Config in environment, not in code	`getenv("DATABASE_URL")` — never hardcode DSN/passwords
4	Backing Services	Treat DB, cache, broker as attached resources	URL from env — swap Postgres for RDS without code change
6	Processes	Execute app as one or more stateless processes	No in-process session state; sessions in Redis
7	Port Binding	Export services via port binding, not app server injection	Service binds `$PORT` itself; Kubernetes routes to it
8	Concurrency	Scale out via the process model	Multiple replicas (K8s `replicas: N`), not threads-per-monolith
9	Disposability	Maximize robustness with fast startup and graceful shutdown	Handle SIGTERM: drain connections, flush buffers, exit 0
11	Logs	Treat logs as event streams — write to stdout	`fprintf(stdout, "...")` — never write to files inside container
12	Admin Processes	Run admin/management tasks as one-off processes	DB migrations as a separate Job (Kubernetes Job), not in service startup

⚙️ Factor 3: Config via Environment

/* Never do this: */ const char *db_url = "postgres://prod-db:5432/app"; /* Do this: */ const char *db_url = getenv("DATABASE_URL"); if (!db_url) { fprintf(stderr, "DATABASE_URL not set\n"); exit(1); }

Config that varies between environments (dev/staging/prod) must never be in code. The same binary runs in all environments — only the environment variables differ.

🔄 Factor 9: Graceful Shutdown in C

static volatile int shutting_down = 0; static void handle_sigterm(int sig) { (void)sig; shutting_down = 1; } int main() { signal(SIGTERM, handle_sigterm); signal(SIGINT, handle_sigterm); while (!shutting_down) { /* serve requests */ } /* Graceful shutdown: drain connections */ drain_active_connections(); rd_kafka_flush(rk, 10000); /* flush pending events */ PQfinish(pg_conn); fprintf(stdout, "Shutdown complete\n"); return 0; }

📊 Factor 11: Structured Logs to stdout

Write logs as structured JSON to stdout. The container runtime captures stdout and forwards to your log aggregation platform (ELK, Loki, Datadog).

/* Structured JSON logging */ #define LOG_INFO(fmt, ...) \ fprintf(stdout, \ "{\"level\":\"INFO\",\"ts\":\"%.3f\",\"msg\":\"" fmt "\"}\n", \ get_unix_ms(), ##__VA_ARGS__) /* Usage */ LOG_INFO("order_placed order_id=%s user_id=%s amount=%.2f", order_id, user_id, amount);

Include in every log line: timestamp, level, service, correlation_id, and the event. This makes logs searchable and correlatable across services in your log aggregator.

── Implementation 1 — Circuit Breaker (Thread-Safe, C11 Atomics) ──

🔌 Circuit Breaker in C (stdatomic, three-state machine)

/* circuit_breaker.h — thread-safe circuit breaker */ #pragma once #include <stdatomic.h> #include <time.h> #include <stdbool.h> typedef enum { CB_CLOSED, CB_OPEN, CB_HALF_OPEN } cb_state_t; typedef struct { _Atomic(int) state; /* cb_state_t */ _Atomic(int) failure_count; _Atomic(long) open_since_ms; /* epoch ms when opened */ int failure_threshold; long open_timeout_ms; } circuit_breaker_t; static inline long now_ms(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1000LL + ts.tv_nsec / 1000000LL; } static inline void cb_init(circuit_breaker_t *cb, int threshold, long timeout_ms) { atomic_store(&cb->state, CB_CLOSED); atomic_store(&cb->failure_count, 0); atomic_store(&cb->open_since_ms, 0); cb->failure_threshold = threshold; cb->open_timeout_ms = timeout_ms; } /* Returns true if the call should be allowed through */ static inline bool cb_allow(circuit_breaker_t *cb) { int state = atomic_load(&cb->state); if (state == CB_CLOSED) return true; if (state == CB_OPEN) { long elapsed = now_ms() - atomic_load(&cb->open_since_ms); if (elapsed >= cb->open_timeout_ms) { /* Transition to HALF_OPEN to probe recovery */ int expected = CB_OPEN; if (atomic_compare_exchange_strong(&cb->state, &expected, CB_HALF_OPEN)) { return true; /* this thread gets the probe request */ } } return false; /* still open */ } /* HALF_OPEN: allow one probe at a time */ return true; } static inline void cb_on_success(circuit_breaker_t *cb) { int state = atomic_load(&cb->state); if (state == CB_HALF_OPEN) { /* Recovery confirmed: close the breaker */ atomic_store(&cb->failure_count, 0); atomic_store(&cb->state, CB_CLOSED); } if (state == CB_CLOSED) { /* Reset failure count on success */ atomic_store(&cb->failure_count, 0); } } static inline void cb_on_failure(circuit_breaker_t *cb) { int state = atomic_load(&cb->state); if (state == CB_HALF_OPEN) { /* Probe failed: reopen the breaker */ atomic_store(&cb->open_since_ms, now_ms()); atomic_store(&cb->state, CB_OPEN); return; } int count = atomic_fetch_add(&cb->failure_count, 1) + 1; if (count >= cb->failure_threshold) { int expected = CB_CLOSED; if (atomic_compare_exchange_strong(&cb->state, &expected, CB_OPEN)) { atomic_store(&cb->open_since_ms, now_ms()); fprintf(stderr, "[CB] Circuit OPENED after %d failures\n", count); } } } /* Usage */ int call_payment_service(void *ctx) { return 0; } /* placeholder */ circuit_breaker_t payment_cb; int charge_customer(const char *order_id, double amount) { if (!cb_allow(&payment_cb)) { fprintf(stderr, "[CB] OPEN: payment service unavailable\n"); return -1; /* fail fast */ } int result = call_payment_service(NULL); if (result == 0) cb_on_success(&payment_cb); else cb_on_failure(&payment_cb); return result; }

── Implementation 2 — Health Check HTTP Server ──

💓 Minimal Health Check HTTP Server (POSIX sockets)

/* health.c — minimal HTTP health check endpoint for Kubernetes probes */ #include <sys/socket.h> #include <netinet/in.h> #include <pthread.h> #include <string.h> #include <unistd.h> #include <stdio.h> static volatile int ready = 0; /* set to 1 once DB connected etc. */ static void *health_thread(void *arg) { (void)arg; int srv = socket(AF_INET, SOCK_STREAM, 0); int opt = 1; setsockopt(srv, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_addr.s_addr = INADDR_ANY, .sin_port = htons(8081) }; bind(srv, (struct sockaddr *)&addr, sizeof(addr)); listen(srv, 16); while (1) { int conn = accept(srv, NULL, NULL); if (conn < 0) continue; char buf[256]; ssize_t n = recv(conn, buf, sizeof(buf) - 1, 0); buf[n > 0 ? n : 0] = '\0'; const char *resp; if (strstr(buf, "GET /health/live")) { resp = "HTTP/1.1 200 OK\r\nContent-Length:2\r\n\r\nOK"; } else if (strstr(buf, "GET /health/ready")) { resp = ready ? "HTTP/1.1 200 OK\r\nContent-Length:5\r\n\r\nREADY" : "HTTP/1.1 503 Service Unavailable\r\nContent-Length:12\r\n\r\nNOT_READY_YET"; } else { resp = "HTTP/1.1 404 Not Found\r\nContent-Length:0\r\n\r\n"; } send(conn, resp, strlen(resp), 0); close(conn); } return NULL; } void start_health_server(void) { pthread_t t; pthread_create(&t, NULL, health_thread, NULL); pthread_detach(t); } void set_ready(int r) { ready = r; }

Start the health server before connecting to databases so the /health/live probe succeeds immediately. Set ready=1 only after all dependencies (DB, Kafka) are connected — this keeps the pod out of the Service load balancer until it's actually ready.

🔬 Lab 1 — Circuit Breaker Under Load

Observe circuit breaker state transitions under real failure conditions.

1 Build the circuit breaker from Tab 8. Write a test harness that calls a "downstream service" function that returns success/failure based on a configurable failure rate.

2 Run 100 concurrent goroutine-equivalent threads (using pthreads) calling the circuit breaker simultaneously. Set failure rate to 80%.

3 Observe and log state transitions: CLOSED → OPEN (trip after N failures) → HALF-OPEN (after timeout) → CLOSED (after probe success).

4 Measure: in OPEN state, what is the p99 response time? (Should be microseconds — fail fast.) Compare to CLOSED state with real downstream calls.

5 Bonus: add a sliding window failure rate threshold (failure rate over last 20 calls, not just a count) and verify it's more resilient to bursty failures.

🔬 Lab 2 — Docker Multi-Stage Build: Size Comparison

Demonstrate the image size impact of multi-stage builds.

1 Write a simple C HTTP server (or use the health server from Tab 8). Compile it manually to confirm it works.

2 Write a single-stage Dockerfile using FROM gcc:13. Build it: docker build -t service:single-stage .. Check size: docker image ls service:single-stage.

3 Write the multi-stage Dockerfile from Tab 4. Build it: docker build -t service:multi-stage .. Compare sizes.

4 Run docker run --rm service:multi-stage. Verify the binary executes correctly in the slim image.

5 Run docker history service:multi-stage — verify no build tools (gcc, make) appear in any layer of the final image.

6 Run docker run --user $(id -u) service:multi-stage — verify non-root execution. Check process inside container: docker exec <id> id.

🔬 Lab 3 — Kubernetes Deployment with Probes

Deploy the C service to a local Kubernetes cluster (minikube or kind).

1 Build and push the Docker image to a local registry: minikube image load service:multi-stage.

2 Apply the Deployment from Tab 5. Watch pods come up: kubectl get pods -w.

3 Observe readiness probe in action: modify the service to delay setting ready=1 by 10 seconds. Watch the pod stay NotReady during startup.

4 Trigger a liveness probe failure: modify the /health/live endpoint to return 503 after receiving 5 requests. Observe Kubernetes restart the pod.

5 Perform a rolling update: rebuild with a different version tag, apply the new image. Watch rolling update: kubectl rollout status deployment/order-service.

6 Rollback: kubectl rollout undo deployment/order-service. Verify the previous image is running.

🔬 Lab 4 — Strangler Fig Migration (Simulated)

Simulate extracting a service from a monolith using Strangler Fig.

1 Write a "monolith": a C HTTP server handling /orders/*, /users/*, and /notifications/* all in one process.

2 Write a new "Notifications microservice": a separate C process handling /notifications/*.

3 Add an Nginx reverse proxy as the "API Gateway": route /notifications/* to the new service, all other paths to the monolith.

4 Verify: requests to /orders/123 hit the monolith. Requests to /notifications/send hit the new service. Both return correct responses.

5 Remove the notifications handler from the monolith. Verify all notification requests still work (now served entirely by new service).

── Phase 6 Mastery Checklist ──

Architecture

Explain Conway's Law and how it drives service boundaries
Describe the Strangler Fig migration pattern step by step
Compare sync REST/gRPC vs async events for inter-service communication
List 5 responsibilities of an API Gateway
Explain client-side vs server-side service discovery

Resilience

Draw the circuit breaker state machine (CLOSED/OPEN/HALF-OPEN)
Implement retry with exponential backoff + jitter
Explain the bulkhead pattern and when to apply it
Set correct timeout values relative to circuit breaker thresholds

Docker & Kubernetes

Write a multi-stage Dockerfile for a C binary
Explain why non-root + exec-form ENTRYPOINT matters
Write a Deployment with liveness + readiness probes
Explain the difference between liveness and readiness probes
Perform a rolling update and rollback with kubectl

CI/CD & 12-Factor

List the 8 stages of a production CI/CD pipeline
Explain blue-green vs canary deployment trade-offs
Apply 12-Factor principles: config in env, logs to stdout, graceful shutdown
Write structured JSON logging and graceful SIGTERM handling in C

← M13: Event-Driven Architecture ↑ Roadmap M17: Observability →