M15 — Microservices & Infrastructure

Phase 6 Service architecture decisions · API Gateway & service discovery · Circuit breaker & bulkhead · Docker multi-stage builds · Kubernetes fundamentals · CI/CD pipelines · 12-Factor App
🏗️ Monolith vs Microservices: The Real Decision
The default answer is start with a monolith — specifically a modular monolith with clean internal boundaries. Split only when you have a concrete reason to, not because microservices are trendy.

When microservices make sense:
  • Team topology: Conway's Law — your system architecture mirrors your communication structure. If you have 5 independent teams, a monolith creates coordination overhead; separate services let teams deploy independently.
  • Independent scaling: one component (e.g., image processing) needs 10× more resources than others — split it to scale independently
  • Technology heterogeneity: ML model serving needs Python, low-latency trading needs C — different services, different stacks
  • Fault isolation: a crash in recommendations shouldn't crash checkout
Microservices costs you must accept:
  • Network latency and reliability in every inter-service call
  • Distributed tracing, log aggregation, and health monitoring for N services
  • Data consistency without distributed transactions (Saga, Outbox)
  • Deployment pipeline for each service
Analogy — Bounded Contexts (DDD):
In an e-commerce domain, "Customer" means something different to the Billing context (credit card, payment history) vs the Shipping context (address, preferred carrier). Each bounded context defines its own model of "Customer" — and each maps to a microservice boundary. Crossing context boundaries requires an explicit translation (anti-corruption layer).
🌱 Modular Monolith First
Before splitting, enforce module boundaries inside the monolith:
  • Each module has a public API (headers/interfaces) — no reaching into internals
  • Modules do not share database tables across boundaries
  • Cross-module calls are synchronous function calls — trivially refactorable to HTTP/gRPC later
  • Modules can be extracted one at a time (Strangler Fig)
If your monolith has clean module boundaries, extracting a service is a lift-and-shift. If it's a big ball of mud, microservices just distribute the mess over a network.
🪴 Strangler Fig Pattern
Incrementally replace a monolith without a big-bang rewrite:
  1. Identify a bounded context to extract (e.g., Notifications)
  2. Build the new service alongside the monolith
  3. Route specific endpoints (/notify/*) through the API Gateway to the new service
  4. Monolith still handles everything else — both coexist
  5. Once new service is stable, remove the monolith's notification module
  6. Repeat for next module
📐 Phase 6 Module Map
ModuleTopicKey Concepts
M15 (this)Microservices & InfrastructureArchitecture decisions, API Gateway, circuit breaker, Docker, K8s, CI/CD
M16Service Mesh & Advanced InfraIstio/Envoy, mTLS, traffic shaping, Helm, Terraform IaC
Prerequisites: Ph3 (Auth — JWT validation at the gateway), Ph5 (Event-Driven — async inter-service communication, Outbox pattern)
🔗 Sync vs Async Inter-Service Communication
DimensionSynchronous (REST/gRPC)Asynchronous (Events/Queues)
CouplingTemporal: caller blocks until callee respondsLoose: caller fires and continues
LatencyFast for simple request/replyAdds queuing delay (ms–seconds)
Failure propagationDownstream failure cascades upstreamBroker buffers; caller unaffected by consumer down
ConsistencyImmediateEventual
ObservabilityEasy: request trace follows call chainHarder: events fan out; need correlation IDs
Best forQueries, user-facing reads, RPCSide effects (email, analytics, downstream processing)
Hybrid pattern: Use sync for the user-facing response (place order → return order ID immediately), then async for all side effects (charge payment, send confirmation, update analytics) via events.
🌐 REST Design for Microservices
  • Versioned endpoints: /v1/orders — never break existing consumers
  • Idempotency keys on POST: Idempotency-Key: {uuid} header
  • Pagination: cursor-based over offset (stable under inserts)
  • Timeout headers: Request-Timeout: 5000 — avoid indefinite waits
  • Structured error responses: {"error":"NOT_FOUND","message":"..."}
  • Health endpoints: /health/live (process alive), /health/ready (dependencies healthy)
⚡ gRPC for Internal Services
gRPC is preferred over REST for internal service-to-service calls:
  • Binary (Protobuf): smaller payload vs JSON, faster serialization
  • Typed contracts: .proto file is the source of truth — no schema drift
  • Streaming: server-side, client-side, and bidirectional streaming
  • HTTP/2: multiplexed connections, header compression
  • Code generation: auto-generated client/server stubs in any language
Use REST for external-facing APIs (browsers, third parties). Use gRPC for internal service mesh.
🚪 API Gateway Responsibilities
The API Gateway is the single entry point for all client traffic. It handles cross-cutting concerns so individual services don't have to:
ResponsibilityHow
RoutingPath-based: /orders/* → Order Service, /users/* → User Service
Auth offloadValidate JWT at gateway; forward X-User-Id header to services — services trust the header
Rate limitingToken bucket per client IP or API key; return 429 Too Many Requests
SSL terminationHTTPS at gateway; plain HTTP on internal network (mTLS for higher security)
Request aggregation (BFF)Backend For Frontend: gateway calls 3 services and merges response — saves mobile client from 3 round trips
Canary routingRoute 5% of traffic to new service version by header/cookie — gradual rollout
ObservabilityAdd X-Request-Id header; log request/response at entry point
🔍 Service Discovery
Services are ephemeral — IPs change when containers restart. Service discovery provides stable addressing.
Client-side discovery (Consul/Eureka)
  • Service registers itself with registry on startup
  • Client queries registry → gets list of healthy instances → client-side load balances (round-robin, etc.)
  • More control but client must implement discovery logic
Server-side discovery (AWS ALB, Kubernetes)
  • Client sends request to load balancer
  • LB queries registry and forwards to healthy instance
  • Client is simple; LB handles all discovery
DNS-based (Kubernetes Services): Kubernetes injects a DNS name for every Service (orders.default.svc.cluster.local). kube-proxy maintains iptables rules that load-balance across healthy Pods. Client just talks to the DNS name — no discovery library needed.
🔀 Correlation ID: Tracing Async Request Chains
When a request fans out across services (sync) or events (async), a correlation ID ties all logs together:
/* At API Gateway: generate if not present */ const char *corr_id = get_header(req, "X-Correlation-Id"); if (!corr_id) corr_id = generate_uuid(); set_header(req, "X-Correlation-Id", corr_id); /* Each service: propagate to outgoing calls AND log every event */ log_info("correlation_id=%s action=order_placed order_id=%s", corr_id, order_id); /* Each event published to Kafka: embed correlation_id in headers */ rd_kafka_headers_add(headers, "correlation_id", strlen("correlation_id"), corr_id, strlen(corr_id));
When debugging a production issue, search all service logs by correlation ID to reconstruct the full request timeline across service boundaries and async event chains.
⚡ Why Resilience Patterns Are Necessary
In a microservices system, any service call can fail or slow down. Without resilience patterns, one slow service causes a cascade failure: upstream services pile up blocked threads waiting for the slow service → thread pool exhaustion → entire system down.

The four primary resilience patterns:
PatternProblem Solved
TimeoutDon't wait forever — bound the worst case latency
RetryTransient failures (network blip) often self-resolve — retry with backoff
Circuit BreakerStop calling a failing service — give it time to recover, fail fast to callers
BulkheadIsolate resource pools — one slow service can't exhaust all threads
🔌 Circuit Breaker — State Machine
Named after electrical circuit breakers that trip to prevent damage. Three states:
┌──────────────────────────────────────────────────────────────────┐ │ CLOSED (normal operation) │ Requests pass through. Count consecutive failures. │ │ failure_count >= threshold (e.g. 5 in 10s) → OPEN └──────────────────────────────────────────────────────────────────┘ │ failures exceed threshold ▼ ┌──────────────────────────────────────────────────────────────────┐ │ OPEN (fail fast) │ ALL requests rejected immediately (no call to downstream). │ │ Returns cached/fallback response or error. │ │ After timeout (e.g. 30s) → HALF-OPEN └──────────────────────────────────────────────────────────────────┘ │ recovery timeout elapsed ▼ ┌──────────────────────────────────────────────────────────────────┐ │ HALF-OPEN (probe) │ Allow N probe requests through. │ │ All probes succeed → CLOSED │ Any probe fails → OPEN (reset timer) └──────────────────────────────────────────────────────────────────┘
🔑 Circuit Breaker: Key Configuration Parameters
ParameterWhat It ControlsGuidance
failure_thresholdN failures to trip OPEN5–10 over a rolling window (not total)
failure_rate_threshold% failure rate to trip (more robust than count)50% failure rate over last 20 requests
open_timeoutHow long to stay OPEN before probing30s–60s, or exponential backoff
half_open_max_callsMax probe calls in HALF-OPEN1–3 probes; don't flood recovering service
slow_call_thresholdCalls slower than N ms count as failuresSet to 2× normal p99 latency
FallbackWhat to return in OPEN stateCached response, degraded response, or structured error
Don't set timeouts too generously. If your circuit breaker timeout is 30s but your HTTP timeout is 60s, threads still block 30–60s before the breaker opens. Always set HTTP timeout ≤ circuit breaker slow_call_threshold.
🚢 Bulkhead Pattern
Named after the watertight compartments in a ship — if one compartment floods, others are sealed off and the ship survives.

In microservices: instead of one shared thread pool for all downstream calls, create separate thread pools per dependency:
/* Thread pool bulkhead: separate pool per downstream service */ typedef struct { pthread_t threads[POOL_SIZE]; work_queue_t queue; const char *name; /* e.g. "payment-service" */ int max_queue; /* reject if queue full */ } bulkhead_pool_t; /* Separate pools: payment can be slow without blocking inventory calls */ bulkhead_pool_t payment_pool = { .name="payment", .max_queue=50 }; bulkhead_pool_t inventory_pool = { .name="inventory", .max_queue=200 };
If Payment Service slows down and fills the payment pool queue, the system returns 503 Service Unavailable for payment calls only. Inventory calls proceed normally — the bulkhead contains the failure.
🔁 Retry with Exponential Backoff + Jitter
Retries are essential for transient failures, but naive retries can cause thundering herd — hundreds of services all retry at the same second and overwhelm the recovering service.

Solution: exponential backoff + random jitter:
/* Retry with exponential backoff and full jitter */ int retry_with_backoff(int (*fn)(void*), void *ctx, int max_attempts, int base_ms) { for (int attempt = 0; attempt < max_attempts; attempt++) { if (fn(ctx) == 0) return 0; /* success */ if (attempt + 1 == max_attempts) break; /* Exponential: base_ms * 2^attempt, capped at 30s */ int cap = base_ms * (1 << attempt); if (cap > 30000) cap = 30000; /* Full jitter: random in [0, cap] — spreads retries */ int delay = rand() % (cap + 1); usleep(delay * 1000); } return -1; /* all attempts failed */ } /* Usage: retry up to 5 times, starting at 100ms base delay */ retry_with_backoff(call_payment_service, &ctx, 5, 100);
Only retry idempotent operations. Never blindly retry a POST that creates a resource — you'll create duplicates. Use idempotency keys (M13) to make POSTs safe to retry.
🐳 Multi-Stage Docker Build for C/C++
A C binary compiled in a full build image can run in a minimal runtime image. Multi-stage builds separate compilation from runtime, dramatically reducing image size (from ~1.2GB to ~20MB):
📄 Dockerfile — Multi-Stage C/C++ Build
# Stage 1: Builder — compile the binary FROM gcc:13-bookworm AS builder WORKDIR /build # Install only build dependencies RUN apt-get update && apt-get install -y --no-install-recommends \ librdkafka-dev \ libssl-dev \ libpq-dev \ cmake \ && rm -rf /var/lib/apt/lists/* # Copy source COPY . . # Compile — statically link where possible for portable binary RUN cmake -DCMAKE_BUILD_TYPE=Release -B build . \ && cmake --build build --target order_service -j$(nproc) ############################################################ # Stage 2: Runtime — minimal image, just the binary FROM debian:bookworm-slim # Install only runtime libraries (no compilers, headers, or build tools) RUN apt-get update && apt-get install -y --no-install-recommends \ librdkafka1 \ libssl3 \ libpq5 \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Security: run as non-root RUN useradd -r -s /bin/false appuser USER appuser WORKDIR /app # Copy only the compiled binary from the builder stage COPY --from=builder /build/build/order_service /app/order_service # ENTRYPOINT: exec form — PID 1 gets signals properly (SIGTERM for graceful shutdown) ENTRYPOINT ["/app/order_service"] # Default arguments (overridable at runtime) CMD ["--port=8080"]
✅ Docker Best Practices
  • Pin image versions: debian:bookworm-slim not debian:latest — reproducible builds
  • Non-root user: useradd -r + USER appuser — container escape with root = host root
  • No secrets in image: use environment variables or secrets mounts, never ARG PASSWORD (visible in layers)
  • COPY specific files: COPY src/ /build/src/ not COPY . . — avoids copying .git, local configs
  • Read-only root filesystem: --read-only flag — forces explicit volume mounts for writable paths
  • Health check: HEALTHCHECK CMD curl -f http://localhost:8080/health/live || exit 1
🚫 Common Docker Mistakes
  • Running as root (default if no USER set)
  • Using latest tag — non-deterministic; breaks reproducibility
  • Building in a single stage — final image carries GCC, headers, build tools
  • Putting secrets in environment variables that get logged
  • Using CMD instead of ENTRYPOINT — docker stop doesn't send SIGTERM to PID 1
  • apt-get update without && apt-get install in same RUN — stale layer cache
  • Not adding a .dockerignore — copies node_modules/, .git/, build artifacts
📄 .dockerignore
# .dockerignore — keep build context small and clean .git .gitignore .github build/ *.o *.a *.so cmake-build-debug/ CMakeCache.txt CMakeFiles/ .env .env.* *.md docs/ tests/
Every byte in the build context is sent to the Docker daemon. Large build contexts (accidental .git inclusion) slow down every build. A good .dockerignore is as important as the Dockerfile itself.
🔄 ENTRYPOINT vs CMD: Graceful Shutdown
When Kubernetes sends SIGTERM (graceful shutdown), the signal goes to PID 1 in the container. If your process is not PID 1, it never receives SIGTERM and gets hard-killed after terminationGracePeriodSeconds.

FormShellPID 1Gets SIGTERM?
ENTRYPOINT ["/app/service"] (exec)NoYour binary✅ Yes
ENTRYPOINT /app/service (shell)/bin/sh -csh❌ No (sh is PID 1)
CMD ["/app/service"] (exec)NoYour binary✅ Yes (if no ENTRYPOINT)
Always use exec form: ENTRYPOINT ["/app/service"]. In your C process, register a SIGTERM handler that drains connections and exits cleanly.
☸️ Kubernetes Core Objects
ObjectPurposeKey Fields
PodSmallest deployable unit: one or more containers sharing network/storagespec.containers[].image, resources, env
DeploymentDeclares desired state: N replicas of a Pod template; manages rolling updates and rollbackspec.replicas, spec.strategy, spec.template
ServiceStable DNS name + ClusterIP that load-balances across matching Pods (by label selector)spec.selector, spec.ports, spec.type
IngressHTTP/S routing rules: hostname/path → Service; TLS terminationspec.rules[].host, spec.tls
ConfigMapNon-sensitive configuration: mounted as env vars or filesdata key-value pairs
SecretSensitive data (passwords, tokens): base64-encoded, encrypted at restdata (base64), type
HPAHorizontal Pod Autoscaler: scales replicas based on CPU/memory/custom metricsspec.minReplicas, spec.maxReplicas, spec.metrics
📄 Kubernetes Deployment — C Service Example
# order-service Deployment + Service apiVersion: apps/v1 kind: Deployment metadata: name: order-service labels: app: order-service spec: replicas: 3 selector: matchLabels: app: order-service strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # at most 1 Pod down during update maxSurge: 1 # at most 1 extra Pod during update template: metadata: labels: app: order-service spec: containers: - name: order-service image: registry.example.com/order-service:1.4.2 ports: - containerPort: 8080 env: - name: DATABASE_URL valueFrom: secretKeyRef: name: order-service-secrets key: database_url - name: KAFKA_BROKERS valueFrom: configMapKeyRef: name: order-service-config key: kafka_brokers resources: requests: cpu: "100m" # 0.1 CPU cores guaranteed memory: "64Mi" limits: cpu: "500m" # burst up to 0.5 CPU memory: "256Mi" # OOM-killed if exceeded livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 3 # restart after 3 consecutive failures readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 3 periodSeconds: 5 failureThreshold: 2 # remove from LB after 2 failures terminationGracePeriodSeconds: 30 --- apiVersion: v1 kind: Service metadata: name: order-service spec: selector: app: order-service ports: - port: 80 targetPort: 8080
💓 Liveness vs Readiness Probes
ProbeFailure ActionPurpose
LivenessRestart the containerIs the process alive? (Detects deadlocks, infinite loops)
ReadinessRemove from Service endpoints (stops traffic)Is the process ready to serve? (DB connected, cache warm)
StartupRestart if not ready within windowSlow-starting apps — disables liveness until startup complete
Never make readiness probe depend on external services. If Payment Service is down, you don't want all Order Service pods removed from load balancing — implement graceful degradation instead.
🚀 Rolling Updates & Rollback
  • Update image: kubectl set image deployment/order-service order-service=registry.example.com/order-service:1.4.3
  • Monitor rollout: kubectl rollout status deployment/order-service
  • Rollback to previous: kubectl rollout undo deployment/order-service
  • Rollback to specific revision: kubectl rollout undo deployment/order-service --to-revision=2
Pod Disruption Budget: minAvailable: 2 — cluster autoscaler and rolling updates respect this; never takes down so many pods that fewer than 2 are available.
🔄 CI/CD Pipeline Stages
A complete pipeline runs on every commit and gates production deployment behind automated quality checks:
  • 1
    Lint & Static Analysis — clang-tidy, cppcheck, clang-format check. Fail fast: bad code never reaches tests. (~30s)
  • 2
    Unit Tests — fast, isolated tests with mocked dependencies. Target: >80% coverage on core business logic. (~2m)
  • 3
    Integration Tests — spin up Postgres, Kafka, Redis via Docker Compose; test real service behavior against real dependencies. (~5m)
  • 4
    Security Scan — Trivy scans for CVEs in base image and dependencies; Semgrep for security antipatterns in code. Block on HIGH/CRITICAL CVEs.
  • 5
    Build OCI Image — multi-stage Docker build. Tag with git SHA: registry/service:abc1234. SHA tags are immutable — never use :latest in production.
  • 6
    Push to Registry — push to container registry. Sign image with cosign for supply chain security.
  • 7
    Deploy to Stagingkubectl set image or Helm upgrade. Run smoke tests against staging URL.
  • 8
    Deploy to Production — manual approval gate (or auto on green staging). Blue-green or canary rollout. Monitor error rate + latency for 10 minutes.
🟢🔵 Blue-Green Deployment
Maintain two identical environments (blue = current, green = new):
  1. Deploy new version to green environment
  2. Run smoke tests on green (not receiving production traffic)
  3. Switch load balancer to point to green (instant cutover)
  4. Blue environment kept running for instant rollback
  5. After confidence period, decommission blue
Pros: zero-downtime, instant rollback
Cons: requires 2× infrastructure during transition
🐦 Canary Deployment
Route a small percentage of traffic to new version first:
  1. Deploy new version alongside old; route 5% of traffic to it
  2. Monitor error rate, latency, business metrics (conversion rate)
  3. If healthy after 10m: increase to 20% → 50% → 100%
  4. If issues: instant rollback by routing 100% back to old version
Pros: real production traffic validation, minimal blast radius
Cons: two versions run simultaneously — must be API-compatible
📄 GitHub Actions Example — C Service CI Pipeline
# .github/workflows/ci.yml name: CI on: push: branches: [main] pull_request: jobs: build-test: runs-on: ubuntu-latest services: postgres: image: postgres:16 env: POSTGRES_PASSWORD: test POSTGRES_DB: testdb options: >- --health-cmd pg_isready --health-interval 10s steps: - uses: actions/checkout@v4 - name: Install dependencies run: | sudo apt-get update sudo apt-get install -y libpq-dev librdkafka-dev clang-tidy cppcheck - name: Configure run: cmake -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_TESTS=ON - name: Build run: cmake --build build -j$(nproc) - name: Lint run: clang-tidy src/*.c -- -Iinclude - name: Unit tests run: ./build/tests/unit_tests - name: Integration tests env: DATABASE_URL: postgres://postgres:test@localhost/testdb run: ./build/tests/integration_tests docker-build: needs: build-test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - name: Build and push image run: | docker build -t registry.example.com/order-service:$ . docker push registry.example.com/order-service:$ - name: Deploy to staging run: | kubectl set image deployment/order-service \ order-service=registry.example.com/order-service:$ kubectl rollout status deployment/order-service --timeout=5m
📋 The 12-Factor App Methodology
A methodology for building software-as-a-service apps that are portable, scalable, and maintainable. Originally from Heroku — now the standard for cloud-native services.
🔑 The 12 Factors (Microservices-Relevant Highlights)
#FactorRuleC Implementation
1CodebaseOne codebase per service, tracked in version controlOne git repo per service; main deploys to production
2DependenciesDeclare and isolate all dependencies explicitlyCMakeLists.txt pins exact library versions; no implicit system libraries
3ConfigConfig in environment, not in codegetenv("DATABASE_URL") — never hardcode DSN/passwords
4Backing ServicesTreat DB, cache, broker as attached resourcesURL from env — swap Postgres for RDS without code change
6ProcessesExecute app as one or more stateless processesNo in-process session state; sessions in Redis
7Port BindingExport services via port binding, not app server injectionService binds $PORT itself; Kubernetes routes to it
8ConcurrencyScale out via the process modelMultiple replicas (K8s replicas: N), not threads-per-monolith
9DisposabilityMaximize robustness with fast startup and graceful shutdownHandle SIGTERM: drain connections, flush buffers, exit 0
11LogsTreat logs as event streams — write to stdoutfprintf(stdout, "...") — never write to files inside container
12Admin ProcessesRun admin/management tasks as one-off processesDB migrations as a separate Job (Kubernetes Job), not in service startup
⚙️ Factor 3: Config via Environment
/* Never do this: */ const char *db_url = "postgres://prod-db:5432/app"; /* Do this: */ const char *db_url = getenv("DATABASE_URL"); if (!db_url) { fprintf(stderr, "DATABASE_URL not set\n"); exit(1); }
Config that varies between environments (dev/staging/prod) must never be in code. The same binary runs in all environments — only the environment variables differ.
🔄 Factor 9: Graceful Shutdown in C
static volatile int shutting_down = 0; static void handle_sigterm(int sig) { (void)sig; shutting_down = 1; } int main() { signal(SIGTERM, handle_sigterm); signal(SIGINT, handle_sigterm); while (!shutting_down) { /* serve requests */ } /* Graceful shutdown: drain connections */ drain_active_connections(); rd_kafka_flush(rk, 10000); /* flush pending events */ PQfinish(pg_conn); fprintf(stdout, "Shutdown complete\n"); return 0; }
📊 Factor 11: Structured Logs to stdout
Write logs as structured JSON to stdout. The container runtime captures stdout and forwards to your log aggregation platform (ELK, Loki, Datadog).
/* Structured JSON logging */ #define LOG_INFO(fmt, ...) \ fprintf(stdout, \ "{\"level\":\"INFO\",\"ts\":\"%.3f\",\"msg\":\"" fmt "\"}\n", \ get_unix_ms(), ##__VA_ARGS__) /* Usage */ LOG_INFO("order_placed order_id=%s user_id=%s amount=%.2f", order_id, user_id, amount);
Include in every log line: timestamp, level, service, correlation_id, and the event. This makes logs searchable and correlatable across services in your log aggregator.
── Implementation 1 — Circuit Breaker (Thread-Safe, C11 Atomics) ──
🔌 Circuit Breaker in C (stdatomic, three-state machine)
/* circuit_breaker.h — thread-safe circuit breaker */ #pragma once #include <stdatomic.h> #include <time.h> #include <stdbool.h> typedef enum { CB_CLOSED, CB_OPEN, CB_HALF_OPEN } cb_state_t; typedef struct { _Atomic(int) state; /* cb_state_t */ _Atomic(int) failure_count; _Atomic(long) open_since_ms; /* epoch ms when opened */ int failure_threshold; long open_timeout_ms; } circuit_breaker_t; static inline long now_ms(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1000LL + ts.tv_nsec / 1000000LL; } static inline void cb_init(circuit_breaker_t *cb, int threshold, long timeout_ms) { atomic_store(&cb->state, CB_CLOSED); atomic_store(&cb->failure_count, 0); atomic_store(&cb->open_since_ms, 0); cb->failure_threshold = threshold; cb->open_timeout_ms = timeout_ms; } /* Returns true if the call should be allowed through */ static inline bool cb_allow(circuit_breaker_t *cb) { int state = atomic_load(&cb->state); if (state == CB_CLOSED) return true; if (state == CB_OPEN) { long elapsed = now_ms() - atomic_load(&cb->open_since_ms); if (elapsed >= cb->open_timeout_ms) { /* Transition to HALF_OPEN to probe recovery */ int expected = CB_OPEN; if (atomic_compare_exchange_strong(&cb->state, &expected, CB_HALF_OPEN)) { return true; /* this thread gets the probe request */ } } return false; /* still open */ } /* HALF_OPEN: allow one probe at a time */ return true; } static inline void cb_on_success(circuit_breaker_t *cb) { int state = atomic_load(&cb->state); if (state == CB_HALF_OPEN) { /* Recovery confirmed: close the breaker */ atomic_store(&cb->failure_count, 0); atomic_store(&cb->state, CB_CLOSED); } if (state == CB_CLOSED) { /* Reset failure count on success */ atomic_store(&cb->failure_count, 0); } } static inline void cb_on_failure(circuit_breaker_t *cb) { int state = atomic_load(&cb->state); if (state == CB_HALF_OPEN) { /* Probe failed: reopen the breaker */ atomic_store(&cb->open_since_ms, now_ms()); atomic_store(&cb->state, CB_OPEN); return; } int count = atomic_fetch_add(&cb->failure_count, 1) + 1; if (count >= cb->failure_threshold) { int expected = CB_CLOSED; if (atomic_compare_exchange_strong(&cb->state, &expected, CB_OPEN)) { atomic_store(&cb->open_since_ms, now_ms()); fprintf(stderr, "[CB] Circuit OPENED after %d failures\n", count); } } } /* Usage */ int call_payment_service(void *ctx) { return 0; } /* placeholder */ circuit_breaker_t payment_cb; int charge_customer(const char *order_id, double amount) { if (!cb_allow(&payment_cb)) { fprintf(stderr, "[CB] OPEN: payment service unavailable\n"); return -1; /* fail fast */ } int result = call_payment_service(NULL); if (result == 0) cb_on_success(&payment_cb); else cb_on_failure(&payment_cb); return result; }
── Implementation 2 — Health Check HTTP Server ──
💓 Minimal Health Check HTTP Server (POSIX sockets)
/* health.c — minimal HTTP health check endpoint for Kubernetes probes */ #include <sys/socket.h> #include <netinet/in.h> #include <pthread.h> #include <string.h> #include <unistd.h> #include <stdio.h> static volatile int ready = 0; /* set to 1 once DB connected etc. */ static void *health_thread(void *arg) { (void)arg; int srv = socket(AF_INET, SOCK_STREAM, 0); int opt = 1; setsockopt(srv, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_addr.s_addr = INADDR_ANY, .sin_port = htons(8081) }; bind(srv, (struct sockaddr *)&addr, sizeof(addr)); listen(srv, 16); while (1) { int conn = accept(srv, NULL, NULL); if (conn < 0) continue; char buf[256]; ssize_t n = recv(conn, buf, sizeof(buf) - 1, 0); buf[n > 0 ? n : 0] = '\0'; const char *resp; if (strstr(buf, "GET /health/live")) { resp = "HTTP/1.1 200 OK\r\nContent-Length:2\r\n\r\nOK"; } else if (strstr(buf, "GET /health/ready")) { resp = ready ? "HTTP/1.1 200 OK\r\nContent-Length:5\r\n\r\nREADY" : "HTTP/1.1 503 Service Unavailable\r\nContent-Length:12\r\n\r\nNOT_READY_YET"; } else { resp = "HTTP/1.1 404 Not Found\r\nContent-Length:0\r\n\r\n"; } send(conn, resp, strlen(resp), 0); close(conn); } return NULL; } void start_health_server(void) { pthread_t t; pthread_create(&t, NULL, health_thread, NULL); pthread_detach(t); } void set_ready(int r) { ready = r; }
Start the health server before connecting to databases so the /health/live probe succeeds immediately. Set ready=1 only after all dependencies (DB, Kafka) are connected — this keeps the pod out of the Service load balancer until it's actually ready.
🔬 Lab 1 — Circuit Breaker Under Load
Observe circuit breaker state transitions under real failure conditions.
1 Build the circuit breaker from Tab 8. Write a test harness that calls a "downstream service" function that returns success/failure based on a configurable failure rate.
2 Run 100 concurrent goroutine-equivalent threads (using pthreads) calling the circuit breaker simultaneously. Set failure rate to 80%.
3 Observe and log state transitions: CLOSED → OPEN (trip after N failures) → HALF-OPEN (after timeout) → CLOSED (after probe success).
4 Measure: in OPEN state, what is the p99 response time? (Should be microseconds — fail fast.) Compare to CLOSED state with real downstream calls.
5 Bonus: add a sliding window failure rate threshold (failure rate over last 20 calls, not just a count) and verify it's more resilient to bursty failures.
🔬 Lab 2 — Docker Multi-Stage Build: Size Comparison
Demonstrate the image size impact of multi-stage builds.
1 Write a simple C HTTP server (or use the health server from Tab 8). Compile it manually to confirm it works.
2 Write a single-stage Dockerfile using FROM gcc:13. Build it: docker build -t service:single-stage .. Check size: docker image ls service:single-stage.
3 Write the multi-stage Dockerfile from Tab 4. Build it: docker build -t service:multi-stage .. Compare sizes.
4 Run docker run --rm service:multi-stage. Verify the binary executes correctly in the slim image.
5 Run docker history service:multi-stage — verify no build tools (gcc, make) appear in any layer of the final image.
6 Run docker run --user $(id -u) service:multi-stage — verify non-root execution. Check process inside container: docker exec <id> id.
🔬 Lab 3 — Kubernetes Deployment with Probes
Deploy the C service to a local Kubernetes cluster (minikube or kind).
1 Build and push the Docker image to a local registry: minikube image load service:multi-stage.
2 Apply the Deployment from Tab 5. Watch pods come up: kubectl get pods -w.
3 Observe readiness probe in action: modify the service to delay setting ready=1 by 10 seconds. Watch the pod stay NotReady during startup.
4 Trigger a liveness probe failure: modify the /health/live endpoint to return 503 after receiving 5 requests. Observe Kubernetes restart the pod.
5 Perform a rolling update: rebuild with a different version tag, apply the new image. Watch rolling update: kubectl rollout status deployment/order-service.
6 Rollback: kubectl rollout undo deployment/order-service. Verify the previous image is running.
🔬 Lab 4 — Strangler Fig Migration (Simulated)
Simulate extracting a service from a monolith using Strangler Fig.
1 Write a "monolith": a C HTTP server handling /orders/*, /users/*, and /notifications/* all in one process.
2 Write a new "Notifications microservice": a separate C process handling /notifications/*.
3 Add an Nginx reverse proxy as the "API Gateway": route /notifications/* to the new service, all other paths to the monolith.
4 Verify: requests to /orders/123 hit the monolith. Requests to /notifications/send hit the new service. Both return correct responses.
5 Remove the notifications handler from the monolith. Verify all notification requests still work (now served entirely by new service).
── Phase 6 Mastery Checklist ──
Architecture
  • Explain Conway's Law and how it drives service boundaries
  • Describe the Strangler Fig migration pattern step by step
  • Compare sync REST/gRPC vs async events for inter-service communication
  • List 5 responsibilities of an API Gateway
  • Explain client-side vs server-side service discovery
Resilience
  • Draw the circuit breaker state machine (CLOSED/OPEN/HALF-OPEN)
  • Implement retry with exponential backoff + jitter
  • Explain the bulkhead pattern and when to apply it
  • Set correct timeout values relative to circuit breaker thresholds
Docker & Kubernetes
  • Write a multi-stage Dockerfile for a C binary
  • Explain why non-root + exec-form ENTRYPOINT matters
  • Write a Deployment with liveness + readiness probes
  • Explain the difference between liveness and readiness probes
  • Perform a rolling update and rollback with kubectl
CI/CD & 12-Factor
  • List the 8 stages of a production CI/CD pipeline
  • Explain blue-green vs canary deployment trade-offs
  • Apply 12-Factor principles: config in env, logs to stdout, graceful shutdown
  • Write structured JSON logging and graceful SIGTERM handling in C