M15 — Microservices & Infrastructure
Phase 6
Service architecture decisions · API Gateway & service discovery · Circuit breaker & bulkhead · Docker multi-stage builds · Kubernetes fundamentals · CI/CD pipelines · 12-Factor App
🏗️ Monolith vs Microservices: The Real Decision
The default answer is start with a monolith — specifically a modular monolith with clean internal boundaries. Split only when you have a concrete reason to, not because microservices are trendy.
When microservices make sense:
When microservices make sense:
- Team topology: Conway's Law — your system architecture mirrors your communication structure. If you have 5 independent teams, a monolith creates coordination overhead; separate services let teams deploy independently.
- Independent scaling: one component (e.g., image processing) needs 10× more resources than others — split it to scale independently
- Technology heterogeneity: ML model serving needs Python, low-latency trading needs C — different services, different stacks
- Fault isolation: a crash in recommendations shouldn't crash checkout
- Network latency and reliability in every inter-service call
- Distributed tracing, log aggregation, and health monitoring for N services
- Data consistency without distributed transactions (Saga, Outbox)
- Deployment pipeline for each service
Analogy — Bounded Contexts (DDD):
In an e-commerce domain, "Customer" means something different to the Billing context (credit card, payment history) vs the Shipping context (address, preferred carrier). Each bounded context defines its own model of "Customer" — and each maps to a microservice boundary. Crossing context boundaries requires an explicit translation (anti-corruption layer).
In an e-commerce domain, "Customer" means something different to the Billing context (credit card, payment history) vs the Shipping context (address, preferred carrier). Each bounded context defines its own model of "Customer" — and each maps to a microservice boundary. Crossing context boundaries requires an explicit translation (anti-corruption layer).
🌱 Modular Monolith First
Before splitting, enforce module boundaries inside the monolith:
- Each module has a public API (headers/interfaces) — no reaching into internals
- Modules do not share database tables across boundaries
- Cross-module calls are synchronous function calls — trivially refactorable to HTTP/gRPC later
- Modules can be extracted one at a time (Strangler Fig)
If your monolith has clean module boundaries, extracting a service is a lift-and-shift. If it's a big ball of mud, microservices just distribute the mess over a network.
🪴 Strangler Fig Pattern
Incrementally replace a monolith without a big-bang rewrite:
- Identify a bounded context to extract (e.g., Notifications)
- Build the new service alongside the monolith
- Route specific endpoints (
/notify/*) through the API Gateway to the new service - Monolith still handles everything else — both coexist
- Once new service is stable, remove the monolith's notification module
- Repeat for next module
📐 Phase 6 Module Map
| Module | Topic | Key Concepts |
|---|---|---|
| M15 (this) | Microservices & Infrastructure | Architecture decisions, API Gateway, circuit breaker, Docker, K8s, CI/CD |
| M16 | Service Mesh & Advanced Infra | Istio/Envoy, mTLS, traffic shaping, Helm, Terraform IaC |
Prerequisites: Ph3 (Auth — JWT validation at the gateway), Ph5 (Event-Driven — async inter-service communication, Outbox pattern)
🔗 Sync vs Async Inter-Service Communication
| Dimension | Synchronous (REST/gRPC) | Asynchronous (Events/Queues) |
|---|---|---|
| Coupling | Temporal: caller blocks until callee responds | Loose: caller fires and continues |
| Latency | Fast for simple request/reply | Adds queuing delay (ms–seconds) |
| Failure propagation | Downstream failure cascades upstream | Broker buffers; caller unaffected by consumer down |
| Consistency | Immediate | Eventual |
| Observability | Easy: request trace follows call chain | Harder: events fan out; need correlation IDs |
| Best for | Queries, user-facing reads, RPC | Side effects (email, analytics, downstream processing) |
Hybrid pattern: Use sync for the user-facing response (place order → return order ID immediately), then async for all side effects (charge payment, send confirmation, update analytics) via events.
🌐 REST Design for Microservices
- Versioned endpoints:
/v1/orders— never break existing consumers - Idempotency keys on POST:
Idempotency-Key: {uuid}header - Pagination: cursor-based over offset (stable under inserts)
- Timeout headers:
Request-Timeout: 5000— avoid indefinite waits - Structured error responses:
{"error":"NOT_FOUND","message":"..."} - Health endpoints:
/health/live(process alive),/health/ready(dependencies healthy)
⚡ gRPC for Internal Services
gRPC is preferred over REST for internal service-to-service calls:
- Binary (Protobuf): smaller payload vs JSON, faster serialization
- Typed contracts:
.protofile is the source of truth — no schema drift - Streaming: server-side, client-side, and bidirectional streaming
- HTTP/2: multiplexed connections, header compression
- Code generation: auto-generated client/server stubs in any language
Use REST for external-facing APIs (browsers, third parties). Use gRPC for internal service mesh.
🚪 API Gateway Responsibilities
The API Gateway is the single entry point for all client traffic. It handles cross-cutting concerns so individual services don't have to:
| Responsibility | How |
|---|---|
| Routing | Path-based: /orders/* → Order Service, /users/* → User Service |
| Auth offload | Validate JWT at gateway; forward X-User-Id header to services — services trust the header |
| Rate limiting | Token bucket per client IP or API key; return 429 Too Many Requests |
| SSL termination | HTTPS at gateway; plain HTTP on internal network (mTLS for higher security) |
| Request aggregation (BFF) | Backend For Frontend: gateway calls 3 services and merges response — saves mobile client from 3 round trips |
| Canary routing | Route 5% of traffic to new service version by header/cookie — gradual rollout |
| Observability | Add X-Request-Id header; log request/response at entry point |
🔍 Service Discovery
Services are ephemeral — IPs change when containers restart. Service discovery provides stable addressing.
Client-side discovery (Consul/Eureka)
- Service registers itself with registry on startup
- Client queries registry → gets list of healthy instances → client-side load balances (round-robin, etc.)
- More control but client must implement discovery logic
Server-side discovery (AWS ALB, Kubernetes)
- Client sends request to load balancer
- LB queries registry and forwards to healthy instance
- Client is simple; LB handles all discovery
DNS-based (Kubernetes Services): Kubernetes injects a DNS name for every Service (
orders.default.svc.cluster.local). kube-proxy maintains iptables rules that load-balance across healthy Pods. Client just talks to the DNS name — no discovery library needed.
🔀 Correlation ID: Tracing Async Request Chains
When a request fans out across services (sync) or events (async), a correlation ID ties all logs together:
/* At API Gateway: generate if not present */
const char *corr_id = get_header(req, "X-Correlation-Id");
if (!corr_id) corr_id = generate_uuid();
set_header(req, "X-Correlation-Id", corr_id);
/* Each service: propagate to outgoing calls AND log every event */
log_info("correlation_id=%s action=order_placed order_id=%s",
corr_id, order_id);
/* Each event published to Kafka: embed correlation_id in headers */
rd_kafka_headers_add(headers,
"correlation_id", strlen("correlation_id"),
corr_id, strlen(corr_id));
When debugging a production issue, search all service logs by correlation ID to reconstruct the full request timeline across service boundaries and async event chains.
⚡ Why Resilience Patterns Are Necessary
In a microservices system, any service call can fail or slow down. Without resilience patterns, one slow service causes a cascade failure: upstream services pile up blocked threads waiting for the slow service → thread pool exhaustion → entire system down.
The four primary resilience patterns:
The four primary resilience patterns:
| Pattern | Problem Solved |
|---|---|
| Timeout | Don't wait forever — bound the worst case latency |
| Retry | Transient failures (network blip) often self-resolve — retry with backoff |
| Circuit Breaker | Stop calling a failing service — give it time to recover, fail fast to callers |
| Bulkhead | Isolate resource pools — one slow service can't exhaust all threads |
🔌 Circuit Breaker — State Machine
Named after electrical circuit breakers that trip to prevent damage. Three states:
┌──────────────────────────────────────────────────────────────────┐
│ CLOSED (normal operation) │
│ Requests pass through. Count consecutive failures. │
│ failure_count >= threshold (e.g. 5 in 10s) → OPEN │
└──────────────────────────────────────────────────────────────────┘
│
failures exceed threshold
▼
┌──────────────────────────────────────────────────────────────────┐
│ OPEN (fail fast) │
│ ALL requests rejected immediately (no call to downstream). │
│ Returns cached/fallback response or error. │
│ After timeout (e.g. 30s) → HALF-OPEN │
└──────────────────────────────────────────────────────────────────┘
│
recovery timeout elapsed
▼
┌──────────────────────────────────────────────────────────────────┐
│ HALF-OPEN (probe) │
│ Allow N probe requests through. │
│ All probes succeed → CLOSED │
│ Any probe fails → OPEN (reset timer) │
└──────────────────────────────────────────────────────────────────┘
🔑 Circuit Breaker: Key Configuration Parameters
| Parameter | What It Controls | Guidance |
|---|---|---|
failure_threshold | N failures to trip OPEN | 5–10 over a rolling window (not total) |
failure_rate_threshold | % failure rate to trip (more robust than count) | 50% failure rate over last 20 requests |
open_timeout | How long to stay OPEN before probing | 30s–60s, or exponential backoff |
half_open_max_calls | Max probe calls in HALF-OPEN | 1–3 probes; don't flood recovering service |
slow_call_threshold | Calls slower than N ms count as failures | Set to 2× normal p99 latency |
| Fallback | What to return in OPEN state | Cached response, degraded response, or structured error |
Don't set timeouts too generously. If your circuit breaker timeout is 30s but your HTTP timeout is 60s, threads still block 30–60s before the breaker opens. Always set HTTP timeout ≤ circuit breaker slow_call_threshold.
🚢 Bulkhead Pattern
Named after the watertight compartments in a ship — if one compartment floods, others are sealed off and the ship survives.
In microservices: instead of one shared thread pool for all downstream calls, create separate thread pools per dependency:
In microservices: instead of one shared thread pool for all downstream calls, create separate thread pools per dependency:
/* Thread pool bulkhead: separate pool per downstream service */
typedef struct {
pthread_t threads[POOL_SIZE];
work_queue_t queue;
const char *name; /* e.g. "payment-service" */
int max_queue; /* reject if queue full */
} bulkhead_pool_t;
/* Separate pools: payment can be slow without blocking inventory calls */
bulkhead_pool_t payment_pool = { .name="payment", .max_queue=50 };
bulkhead_pool_t inventory_pool = { .name="inventory", .max_queue=200 };
If Payment Service slows down and fills the payment pool queue, the system returns
503 Service Unavailable for payment calls only. Inventory calls proceed normally — the bulkhead contains the failure.🔁 Retry with Exponential Backoff + Jitter
Retries are essential for transient failures, but naive retries can cause thundering herd — hundreds of services all retry at the same second and overwhelm the recovering service.
Solution: exponential backoff + random jitter:
Solution: exponential backoff + random jitter:
/* Retry with exponential backoff and full jitter */
int retry_with_backoff(int (*fn)(void*), void *ctx,
int max_attempts, int base_ms) {
for (int attempt = 0; attempt < max_attempts; attempt++) {
if (fn(ctx) == 0) return 0; /* success */
if (attempt + 1 == max_attempts) break;
/* Exponential: base_ms * 2^attempt, capped at 30s */
int cap = base_ms * (1 << attempt);
if (cap > 30000) cap = 30000;
/* Full jitter: random in [0, cap] — spreads retries */
int delay = rand() % (cap + 1);
usleep(delay * 1000);
}
return -1; /* all attempts failed */
}
/* Usage: retry up to 5 times, starting at 100ms base delay */
retry_with_backoff(call_payment_service, &ctx, 5, 100);
Only retry idempotent operations. Never blindly retry a POST that creates a resource — you'll create duplicates. Use idempotency keys (M13) to make POSTs safe to retry.
🐳 Multi-Stage Docker Build for C/C++
A C binary compiled in a full build image can run in a minimal runtime image. Multi-stage builds separate compilation from runtime, dramatically reducing image size (from ~1.2GB to ~20MB):
📄 Dockerfile — Multi-Stage C/C++ Build
# Stage 1: Builder — compile the binary
FROM gcc:13-bookworm AS builder
WORKDIR /build
# Install only build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
librdkafka-dev \
libssl-dev \
libpq-dev \
cmake \
&& rm -rf /var/lib/apt/lists/*
# Copy source
COPY . .
# Compile — statically link where possible for portable binary
RUN cmake -DCMAKE_BUILD_TYPE=Release -B build . \
&& cmake --build build --target order_service -j$(nproc)
############################################################
# Stage 2: Runtime — minimal image, just the binary
FROM debian:bookworm-slim
# Install only runtime libraries (no compilers, headers, or build tools)
RUN apt-get update && apt-get install -y --no-install-recommends \
librdkafka1 \
libssl3 \
libpq5 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Security: run as non-root
RUN useradd -r -s /bin/false appuser
USER appuser
WORKDIR /app
# Copy only the compiled binary from the builder stage
COPY --from=builder /build/build/order_service /app/order_service
# ENTRYPOINT: exec form — PID 1 gets signals properly (SIGTERM for graceful shutdown)
ENTRYPOINT ["/app/order_service"]
# Default arguments (overridable at runtime)
CMD ["--port=8080"]
✅ Docker Best Practices
- Pin image versions:
debian:bookworm-slimnotdebian:latest— reproducible builds - Non-root user:
useradd -r+USER appuser— container escape with root = host root - No secrets in image: use environment variables or secrets mounts, never
ARG PASSWORD(visible in layers) - COPY specific files:
COPY src/ /build/src/notCOPY . .— avoids copying.git, local configs - Read-only root filesystem:
--read-onlyflag — forces explicit volume mounts for writable paths - Health check:
HEALTHCHECK CMD curl -f http://localhost:8080/health/live || exit 1
🚫 Common Docker Mistakes
- Running as root (default if no USER set)
- Using
latesttag — non-deterministic; breaks reproducibility - Building in a single stage — final image carries GCC, headers, build tools
- Putting secrets in environment variables that get logged
- Using CMD instead of ENTRYPOINT —
docker stopdoesn't send SIGTERM to PID 1 apt-get updatewithout&& apt-get installin same RUN — stale layer cache- Not adding a
.dockerignore— copiesnode_modules/,.git/, build artifacts
📄 .dockerignore
# .dockerignore — keep build context small and clean
.git
.gitignore
.github
build/
*.o
*.a
*.so
cmake-build-debug/
CMakeCache.txt
CMakeFiles/
.env
.env.*
*.md
docs/
tests/
Every byte in the build context is sent to the Docker daemon. Large build contexts (accidental
.git inclusion) slow down every build. A good .dockerignore is as important as the Dockerfile itself.🔄 ENTRYPOINT vs CMD: Graceful Shutdown
When Kubernetes sends SIGTERM (graceful shutdown), the signal goes to PID 1 in the container. If your process is not PID 1, it never receives SIGTERM and gets hard-killed after
terminationGracePeriodSeconds.
| Form | Shell | PID 1 | Gets SIGTERM? |
|---|---|---|---|
ENTRYPOINT ["/app/service"] (exec) | No | Your binary | ✅ Yes |
ENTRYPOINT /app/service (shell) | /bin/sh -c | sh | ❌ No (sh is PID 1) |
CMD ["/app/service"] (exec) | No | Your binary | ✅ Yes (if no ENTRYPOINT) |
Always use exec form:
ENTRYPOINT ["/app/service"]. In your C process, register a SIGTERM handler that drains connections and exits cleanly.☸️ Kubernetes Core Objects
| Object | Purpose | Key Fields |
|---|---|---|
| Pod | Smallest deployable unit: one or more containers sharing network/storage | spec.containers[].image, resources, env |
| Deployment | Declares desired state: N replicas of a Pod template; manages rolling updates and rollback | spec.replicas, spec.strategy, spec.template |
| Service | Stable DNS name + ClusterIP that load-balances across matching Pods (by label selector) | spec.selector, spec.ports, spec.type |
| Ingress | HTTP/S routing rules: hostname/path → Service; TLS termination | spec.rules[].host, spec.tls |
| ConfigMap | Non-sensitive configuration: mounted as env vars or files | data key-value pairs |
| Secret | Sensitive data (passwords, tokens): base64-encoded, encrypted at rest | data (base64), type |
| HPA | Horizontal Pod Autoscaler: scales replicas based on CPU/memory/custom metrics | spec.minReplicas, spec.maxReplicas, spec.metrics |
📄 Kubernetes Deployment — C Service Example
# order-service Deployment + Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
labels:
app: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # at most 1 Pod down during update
maxSurge: 1 # at most 1 extra Pod during update
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: registry.example.com/order-service:1.4.2
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: order-service-secrets
key: database_url
- name: KAFKA_BROKERS
valueFrom:
configMapKeyRef:
name: order-service-config
key: kafka_brokers
resources:
requests:
cpu: "100m" # 0.1 CPU cores guaranteed
memory: "64Mi"
limits:
cpu: "500m" # burst up to 0.5 CPU
memory: "256Mi" # OOM-killed if exceeded
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3 # restart after 3 consecutive failures
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2 # remove from LB after 2 failures
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service
ports:
- port: 80
targetPort: 8080
💓 Liveness vs Readiness Probes
| Probe | Failure Action | Purpose |
|---|---|---|
| Liveness | Restart the container | Is the process alive? (Detects deadlocks, infinite loops) |
| Readiness | Remove from Service endpoints (stops traffic) | Is the process ready to serve? (DB connected, cache warm) |
| Startup | Restart if not ready within window | Slow-starting apps — disables liveness until startup complete |
Never make readiness probe depend on external services. If Payment Service is down, you don't want all Order Service pods removed from load balancing — implement graceful degradation instead.
🚀 Rolling Updates & Rollback
- Update image:
kubectl set image deployment/order-service order-service=registry.example.com/order-service:1.4.3 - Monitor rollout:
kubectl rollout status deployment/order-service - Rollback to previous:
kubectl rollout undo deployment/order-service - Rollback to specific revision:
kubectl rollout undo deployment/order-service --to-revision=2
Pod Disruption Budget:
minAvailable: 2 — cluster autoscaler and rolling updates respect this; never takes down so many pods that fewer than 2 are available.🔄 CI/CD Pipeline Stages
A complete pipeline runs on every commit and gates production deployment behind automated quality checks:
- 1Lint & Static Analysis — clang-tidy, cppcheck, clang-format check. Fail fast: bad code never reaches tests. (~30s)
- 2Unit Tests — fast, isolated tests with mocked dependencies. Target: >80% coverage on core business logic. (~2m)
- 3Integration Tests — spin up Postgres, Kafka, Redis via Docker Compose; test real service behavior against real dependencies. (~5m)
- 4Security Scan — Trivy scans for CVEs in base image and dependencies; Semgrep for security antipatterns in code. Block on HIGH/CRITICAL CVEs.
- 5Build OCI Image — multi-stage Docker build. Tag with git SHA:
registry/service:abc1234. SHA tags are immutable — never use:latestin production. - 6Push to Registry — push to container registry. Sign image with cosign for supply chain security.
- 7Deploy to Staging —
kubectl set imageor Helm upgrade. Run smoke tests against staging URL. - 8Deploy to Production — manual approval gate (or auto on green staging). Blue-green or canary rollout. Monitor error rate + latency for 10 minutes.
🟢🔵 Blue-Green Deployment
Maintain two identical environments (blue = current, green = new):
Cons: requires 2× infrastructure during transition
- Deploy new version to green environment
- Run smoke tests on green (not receiving production traffic)
- Switch load balancer to point to green (instant cutover)
- Blue environment kept running for instant rollback
- After confidence period, decommission blue
Cons: requires 2× infrastructure during transition
🐦 Canary Deployment
Route a small percentage of traffic to new version first:
Cons: two versions run simultaneously — must be API-compatible
- Deploy new version alongside old; route 5% of traffic to it
- Monitor error rate, latency, business metrics (conversion rate)
- If healthy after 10m: increase to 20% → 50% → 100%
- If issues: instant rollback by routing 100% back to old version
Cons: two versions run simultaneously — must be API-compatible
📄 GitHub Actions Example — C Service CI Pipeline
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
build-test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
options: >-
--health-cmd pg_isready
--health-interval 10s
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y libpq-dev librdkafka-dev clang-tidy cppcheck
- name: Configure
run: cmake -B build -DCMAKE_BUILD_TYPE=Debug -DENABLE_TESTS=ON
- name: Build
run: cmake --build build -j$(nproc)
- name: Lint
run: clang-tidy src/*.c -- -Iinclude
- name: Unit tests
run: ./build/tests/unit_tests
- name: Integration tests
env:
DATABASE_URL: postgres://postgres:test@localhost/testdb
run: ./build/tests/integration_tests
docker-build:
needs: build-test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Build and push image
run: |
docker build -t registry.example.com/order-service:$ .
docker push registry.example.com/order-service:$
- name: Deploy to staging
run: |
kubectl set image deployment/order-service \
order-service=registry.example.com/order-service:$
kubectl rollout status deployment/order-service --timeout=5m
📋 The 12-Factor App Methodology
A methodology for building software-as-a-service apps that are portable, scalable, and maintainable. Originally from Heroku — now the standard for cloud-native services.
🔑 The 12 Factors (Microservices-Relevant Highlights)
| # | Factor | Rule | C Implementation |
|---|---|---|---|
| 1 | Codebase | One codebase per service, tracked in version control | One git repo per service; main deploys to production |
| 2 | Dependencies | Declare and isolate all dependencies explicitly | CMakeLists.txt pins exact library versions; no implicit system libraries |
| 3 | Config | Config in environment, not in code | getenv("DATABASE_URL") — never hardcode DSN/passwords |
| 4 | Backing Services | Treat DB, cache, broker as attached resources | URL from env — swap Postgres for RDS without code change |
| 6 | Processes | Execute app as one or more stateless processes | No in-process session state; sessions in Redis |
| 7 | Port Binding | Export services via port binding, not app server injection | Service binds $PORT itself; Kubernetes routes to it |
| 8 | Concurrency | Scale out via the process model | Multiple replicas (K8s replicas: N), not threads-per-monolith |
| 9 | Disposability | Maximize robustness with fast startup and graceful shutdown | Handle SIGTERM: drain connections, flush buffers, exit 0 |
| 11 | Logs | Treat logs as event streams — write to stdout | fprintf(stdout, "...") — never write to files inside container |
| 12 | Admin Processes | Run admin/management tasks as one-off processes | DB migrations as a separate Job (Kubernetes Job), not in service startup |
⚙️ Factor 3: Config via Environment
/* Never do this: */
const char *db_url = "postgres://prod-db:5432/app";
/* Do this: */
const char *db_url = getenv("DATABASE_URL");
if (!db_url) {
fprintf(stderr, "DATABASE_URL not set\n");
exit(1);
}
Config that varies between environments (dev/staging/prod) must never be in code. The same binary runs in all environments — only the environment variables differ.
🔄 Factor 9: Graceful Shutdown in C
static volatile int shutting_down = 0;
static void handle_sigterm(int sig) {
(void)sig;
shutting_down = 1;
}
int main() {
signal(SIGTERM, handle_sigterm);
signal(SIGINT, handle_sigterm);
while (!shutting_down) {
/* serve requests */
}
/* Graceful shutdown: drain connections */
drain_active_connections();
rd_kafka_flush(rk, 10000); /* flush pending events */
PQfinish(pg_conn);
fprintf(stdout, "Shutdown complete\n");
return 0;
}
📊 Factor 11: Structured Logs to stdout
Write logs as structured JSON to stdout. The container runtime captures stdout and forwards to your log aggregation platform (ELK, Loki, Datadog).
/* Structured JSON logging */
#define LOG_INFO(fmt, ...) \
fprintf(stdout, \
"{\"level\":\"INFO\",\"ts\":\"%.3f\",\"msg\":\"" fmt "\"}\n", \
get_unix_ms(), ##__VA_ARGS__)
/* Usage */
LOG_INFO("order_placed order_id=%s user_id=%s amount=%.2f",
order_id, user_id, amount);
Include in every log line:
timestamp, level, service, correlation_id, and the event. This makes logs searchable and correlatable across services in your log aggregator.── Implementation 1 — Circuit Breaker (Thread-Safe, C11 Atomics) ──
🔌 Circuit Breaker in C (stdatomic, three-state machine)
/* circuit_breaker.h — thread-safe circuit breaker */
#pragma once
#include <stdatomic.h>
#include <time.h>
#include <stdbool.h>
typedef enum { CB_CLOSED, CB_OPEN, CB_HALF_OPEN } cb_state_t;
typedef struct {
_Atomic(int) state; /* cb_state_t */
_Atomic(int) failure_count;
_Atomic(long) open_since_ms; /* epoch ms when opened */
int failure_threshold;
long open_timeout_ms;
} circuit_breaker_t;
static inline long now_ms(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000LL + ts.tv_nsec / 1000000LL;
}
static inline void cb_init(circuit_breaker_t *cb,
int threshold, long timeout_ms) {
atomic_store(&cb->state, CB_CLOSED);
atomic_store(&cb->failure_count, 0);
atomic_store(&cb->open_since_ms, 0);
cb->failure_threshold = threshold;
cb->open_timeout_ms = timeout_ms;
}
/* Returns true if the call should be allowed through */
static inline bool cb_allow(circuit_breaker_t *cb) {
int state = atomic_load(&cb->state);
if (state == CB_CLOSED) return true;
if (state == CB_OPEN) {
long elapsed = now_ms() - atomic_load(&cb->open_since_ms);
if (elapsed >= cb->open_timeout_ms) {
/* Transition to HALF_OPEN to probe recovery */
int expected = CB_OPEN;
if (atomic_compare_exchange_strong(&cb->state,
&expected, CB_HALF_OPEN)) {
return true; /* this thread gets the probe request */
}
}
return false; /* still open */
}
/* HALF_OPEN: allow one probe at a time */
return true;
}
static inline void cb_on_success(circuit_breaker_t *cb) {
int state = atomic_load(&cb->state);
if (state == CB_HALF_OPEN) {
/* Recovery confirmed: close the breaker */
atomic_store(&cb->failure_count, 0);
atomic_store(&cb->state, CB_CLOSED);
}
if (state == CB_CLOSED) {
/* Reset failure count on success */
atomic_store(&cb->failure_count, 0);
}
}
static inline void cb_on_failure(circuit_breaker_t *cb) {
int state = atomic_load(&cb->state);
if (state == CB_HALF_OPEN) {
/* Probe failed: reopen the breaker */
atomic_store(&cb->open_since_ms, now_ms());
atomic_store(&cb->state, CB_OPEN);
return;
}
int count = atomic_fetch_add(&cb->failure_count, 1) + 1;
if (count >= cb->failure_threshold) {
int expected = CB_CLOSED;
if (atomic_compare_exchange_strong(&cb->state, &expected, CB_OPEN)) {
atomic_store(&cb->open_since_ms, now_ms());
fprintf(stderr, "[CB] Circuit OPENED after %d failures\n", count);
}
}
}
/* Usage */
int call_payment_service(void *ctx) { return 0; } /* placeholder */
circuit_breaker_t payment_cb;
int charge_customer(const char *order_id, double amount) {
if (!cb_allow(&payment_cb)) {
fprintf(stderr, "[CB] OPEN: payment service unavailable\n");
return -1; /* fail fast */
}
int result = call_payment_service(NULL);
if (result == 0)
cb_on_success(&payment_cb);
else
cb_on_failure(&payment_cb);
return result;
}
── Implementation 2 — Health Check HTTP Server ──
💓 Minimal Health Check HTTP Server (POSIX sockets)
/* health.c — minimal HTTP health check endpoint for Kubernetes probes */
#include <sys/socket.h>
#include <netinet/in.h>
#include <pthread.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
static volatile int ready = 0; /* set to 1 once DB connected etc. */
static void *health_thread(void *arg) {
(void)arg;
int srv = socket(AF_INET, SOCK_STREAM, 0);
int opt = 1;
setsockopt(srv, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_addr.s_addr = INADDR_ANY,
.sin_port = htons(8081)
};
bind(srv, (struct sockaddr *)&addr, sizeof(addr));
listen(srv, 16);
while (1) {
int conn = accept(srv, NULL, NULL);
if (conn < 0) continue;
char buf[256];
ssize_t n = recv(conn, buf, sizeof(buf) - 1, 0);
buf[n > 0 ? n : 0] = '\0';
const char *resp;
if (strstr(buf, "GET /health/live")) {
resp = "HTTP/1.1 200 OK\r\nContent-Length:2\r\n\r\nOK";
} else if (strstr(buf, "GET /health/ready")) {
resp = ready
? "HTTP/1.1 200 OK\r\nContent-Length:5\r\n\r\nREADY"
: "HTTP/1.1 503 Service Unavailable\r\nContent-Length:12\r\n\r\nNOT_READY_YET";
} else {
resp = "HTTP/1.1 404 Not Found\r\nContent-Length:0\r\n\r\n";
}
send(conn, resp, strlen(resp), 0);
close(conn);
}
return NULL;
}
void start_health_server(void) {
pthread_t t;
pthread_create(&t, NULL, health_thread, NULL);
pthread_detach(t);
}
void set_ready(int r) { ready = r; }
Start the health server before connecting to databases so the
/health/live probe succeeds immediately. Set ready=1 only after all dependencies (DB, Kafka) are connected — this keeps the pod out of the Service load balancer until it's actually ready.🔬 Lab 1 — Circuit Breaker Under Load
Observe circuit breaker state transitions under real failure conditions.
1 Build the circuit breaker from Tab 8. Write a test harness that calls a "downstream service" function that returns success/failure based on a configurable failure rate.
2 Run 100 concurrent goroutine-equivalent threads (using pthreads) calling the circuit breaker simultaneously. Set failure rate to 80%.
3 Observe and log state transitions: CLOSED → OPEN (trip after N failures) → HALF-OPEN (after timeout) → CLOSED (after probe success).
4 Measure: in OPEN state, what is the p99 response time? (Should be microseconds — fail fast.) Compare to CLOSED state with real downstream calls.
5 Bonus: add a sliding window failure rate threshold (failure rate over last 20 calls, not just a count) and verify it's more resilient to bursty failures.
🔬 Lab 2 — Docker Multi-Stage Build: Size Comparison
Demonstrate the image size impact of multi-stage builds.
1 Write a simple C HTTP server (or use the health server from Tab 8). Compile it manually to confirm it works.
2 Write a single-stage Dockerfile using
FROM gcc:13. Build it: docker build -t service:single-stage .. Check size: docker image ls service:single-stage.3 Write the multi-stage Dockerfile from Tab 4. Build it:
docker build -t service:multi-stage .. Compare sizes.4 Run
docker run --rm service:multi-stage. Verify the binary executes correctly in the slim image.5 Run
docker history service:multi-stage — verify no build tools (gcc, make) appear in any layer of the final image.6 Run
docker run --user $(id -u) service:multi-stage — verify non-root execution. Check process inside container: docker exec <id> id.🔬 Lab 3 — Kubernetes Deployment with Probes
Deploy the C service to a local Kubernetes cluster (minikube or kind).
1 Build and push the Docker image to a local registry:
minikube image load service:multi-stage.2 Apply the Deployment from Tab 5. Watch pods come up:
kubectl get pods -w.3 Observe readiness probe in action: modify the service to delay setting
ready=1 by 10 seconds. Watch the pod stay NotReady during startup.4 Trigger a liveness probe failure: modify the
/health/live endpoint to return 503 after receiving 5 requests. Observe Kubernetes restart the pod.5 Perform a rolling update: rebuild with a different version tag, apply the new image. Watch rolling update:
kubectl rollout status deployment/order-service.6 Rollback:
kubectl rollout undo deployment/order-service. Verify the previous image is running.🔬 Lab 4 — Strangler Fig Migration (Simulated)
Simulate extracting a service from a monolith using Strangler Fig.
1 Write a "monolith": a C HTTP server handling
/orders/*, /users/*, and /notifications/* all in one process.2 Write a new "Notifications microservice": a separate C process handling
/notifications/*.3 Add an Nginx reverse proxy as the "API Gateway": route
/notifications/* to the new service, all other paths to the monolith.4 Verify: requests to
/orders/123 hit the monolith. Requests to /notifications/send hit the new service. Both return correct responses.5 Remove the notifications handler from the monolith. Verify all notification requests still work (now served entirely by new service).
── Phase 6 Mastery Checklist ──
Architecture
- Explain Conway's Law and how it drives service boundaries
- Describe the Strangler Fig migration pattern step by step
- Compare sync REST/gRPC vs async events for inter-service communication
- List 5 responsibilities of an API Gateway
- Explain client-side vs server-side service discovery
- Draw the circuit breaker state machine (CLOSED/OPEN/HALF-OPEN)
- Implement retry with exponential backoff + jitter
- Explain the bulkhead pattern and when to apply it
- Set correct timeout values relative to circuit breaker thresholds
Docker & Kubernetes
- Write a multi-stage Dockerfile for a C binary
- Explain why non-root + exec-form ENTRYPOINT matters
- Write a Deployment with liveness + readiness probes
- Explain the difference between liveness and readiness probes
- Perform a rolling update and rollback with kubectl
- List the 8 stages of a production CI/CD pipeline
- Explain blue-green vs canary deployment trade-offs
- Apply 12-Factor principles: config in env, logs to stdout, graceful shutdown
- Write structured JSON logging and graceful SIGTERM handling in C