M17 — Observability & Hardening

Phase 7 3 pillars: logs · metrics · traces · Prometheus & PromQL · Distributed tracing & OpenTelemetry · Rate limiting · OWASP Top 10 · Secrets management · Graceful shutdown

🔭 The 3 Pillars of Observability

Observability is the ability to understand the internal state of a system from its external outputs. The three pillars each answer a different question:

Pillar	Question Answered	Data Type	Tools
Logs	"What happened, exactly?"	Discrete events with context	ELK, Loki, Fluentd
Metrics	"How fast / how many / how full?"	Aggregated numeric time-series	Prometheus, Grafana, Datadog
Traces	"Why was this request slow?"	Causal chains across services	Jaeger, Tempo, Zipkin, Honeycomb

The three pillars are complementary, not interchangeable. An alert fires on a metric (high p99 latency). You look at a trace to find the slow span. You look at logs from that span to see the exact error. Use all three together.

Analogy — The flight data recorder:
Logs are the cockpit voice recorder — full narrative of what was said. Metrics are the flight data recorder — altitude, speed, attitude plotted over time. Traces are the air traffic control replay — the full path of the aircraft from departure to destination. An accident investigation uses all three.

📐 Observability vs Monitoring

	Monitoring	Observability
Approach	Predefined thresholds and dashboards for known failure modes	Ability to ask arbitrary questions about system behavior
Limits	Only catches failures you anticipated and built alerts for	Enables debugging novel, unknown failure modes
Data	Aggregated metrics, simple health checks	Logs + metrics + traces with high cardinality
Tooling	Nagios, simple dashboards	OpenTelemetry, Honeycomb, Grafana + Loki + Tempo

Start with monitoring (dashboards for known metrics, alerts on thresholds). Add observability as system complexity grows — when you start debugging failures you didn't anticipate.

📐 Phase 7 Module Map

Module	Topic	Key Concepts
M17 (this)	Observability & Hardening	Logs, metrics, traces, alerting, SLO, security, rate limiting
M18	Performance Engineering	Profiling, flame graphs, memory analysis, benchmark methodology

Prerequisites: Ph6 (Microservices — you need services to observe; health probes from M15 are the basis of readiness checks here)

📝 Structured Logging: JSON Lines Format

Write one JSON object per line to stdout. Every log line must include mandatory fields for searchability:

/* Good: structured JSON log */ {"ts":"2026-03-27T14:23:01.442Z","level":"INFO","service":"order-svc", "trace_id":"4bf92f3577b34da6","span_id":"00f067aa0ba902b7", "msg":"order placed","order_id":"ord-9821","user_id":"u-44","amount_usd":49.99} /* Bad: unstructured text — unsearchable */ [2026-03-27 14:23:01] INFO: Order ord-9821 placed by user u-44 for $49.99

📋 Mandatory Log Fields

Field	Format	Purpose
`ts`	ISO-8601 with ms	Timeline reconstruction
`level`	DEBUG/INFO/WARN/ERROR/FATAL	Log level filtering
`service`	service name	Multi-service log aggregation
`trace_id`	hex string	Correlate with traces
`span_id`	hex string	Correlate with specific span
`msg`	human-readable	Event description

🎚️ Log Levels — When to Use Each

Level	Use When
`DEBUG`	Verbose detail for local dev only. Never in production — log volume explosion.
`INFO`	Normal business events: request received, order placed, job started.
`WARN`	Degraded but recoverable: retry succeeded, cache miss, approaching limit.
`ERROR`	Unexpected failure requiring attention: DB timeout, invalid state, downstream error.
`FATAL`	Unrecoverable — process will exit after logging.

🚫 What NOT to Log

Never Log	Why	Alternative
Passwords, API keys, tokens	Log aggregators, retention, and breach exposure	Log presence/absence, not value
Full credit card numbers	PCI-DSS violation	Log last 4 digits only
PII (emails, SSN, full name)	GDPR/CCPA violation	Log user_id (opaque reference)
Full request/response bodies	Volume, PII risk	Log status codes and latency only
Health probe hits (`/health/*`)	Thousands/min of noise	Filter at log aggregator

Log injection: never embed user-supplied strings directly in log messages without sanitization. A user who sets their name to ","level":"ERROR","msg":"admin escalation can forge log entries. Escape or use parameterized logging.

🔗 Correlation: Linking Logs Across Services

The trace_id links all log lines for a single request across every service it touches:

/* API Gateway generates trace_id */ GET /orders/123 → trace_id: 4bf92f3577b34da6 /* Order Service log */ {"service":"order-svc","trace_id":"4bf92f3577b34da6","msg":"fetching order"} /* Database call log */ {"service":"order-svc","trace_id":"4bf92f3577b34da6","msg":"db query","duration_ms":12} /* In Loki/Elasticsearch: search by trace_id to see full request timeline */ {trace_id="4bf92f3577b34da6"} | json | sort by ts

In Grafana with Loki + Tempo integration: click a trace in Tempo → jump directly to correlated logs in Loki for that trace_id. This cross-pillar navigation is the power of consistent trace_id propagation.

📊 Metric Types

Type	Properties	Example	PromQL Usage
Counter	Monotonically increasing, never decreases, resets to 0 on restart	`http_requests_total{method="GET",status="200"}`	`rate(http_requests_total[5m])` → requests/sec
Gauge	Point-in-time value, can go up or down	`active_connections`, `memory_usage_bytes`, `queue_depth`	Direct: `active_connections > 1000`
Histogram	Samples bucketed by value; provides `_count`, `_sum`, `_bucket`	`request_duration_seconds{le="0.1"}`	`histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))`
Summary	Pre-computed quantiles on client side (less flexible)	`request_duration_seconds{quantile="0.99"}`	Direct quantile access; can't re-aggregate across instances

Prefer Histogram over Summary. Histograms can be aggregated across multiple instances (e.g., p99 across all pods). Summaries compute quantiles per-process and can't be aggregated.

🔴 RED Method (Services)

The minimal set of metrics for any service:

Rate: requests per second — is traffic normal?
rate(http_requests_total[5m])
Errors: error rate — are users experiencing failures?
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
Duration: latency percentiles — is it slow?
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))

Alert on all three. High error rate → immediate page. High p99 → investigate. Low rate → traffic drop (upstream issue or deploy broke routing).

📊 USE Method (Resources)

For infrastructure resources (CPU, memory, disk, network):

Utilization: % time resource is busy
rate(process_cpu_seconds_total[1m]) * 100
Saturation: extra work queued (can't keep up)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)
Errors: error count or rate of resource
node_disk_io_time_weighted_seconds_total

High utilization alone isn't a problem. High utilization + high saturation = at capacity. Alert when saturation exceeds 0 (work is queued).

📡 Prometheus: Pull-Based Scraping & Exposition Format

Prometheus polls your service's /metrics endpoint on a configurable interval (typically 15s). Your service exposes metrics in the Prometheus text format:

# HELP http_requests_total Total HTTP requests by method and status code # TYPE http_requests_total counter http_requests_total{method="GET",status="200"} 14823 http_requests_total{method="POST",status="201"} 3291 http_requests_total{method="GET",status="500"} 42 # HELP request_duration_seconds Request latency histogram # TYPE request_duration_seconds histogram request_duration_seconds_bucket{le="0.01"} 8901 request_duration_seconds_bucket{le="0.05"} 13102 request_duration_seconds_bucket{le="0.1"} 14651 request_duration_seconds_bucket{le="0.5"} 14820 request_duration_seconds_bucket{le="+Inf"} 14823 request_duration_seconds_sum 891.23 request_duration_seconds_count 14823 # HELP active_connections Current active connections # TYPE active_connections gauge active_connections 47

Labels are high-cardinality risk. A label like {user_id="..."} creates one time-series per user — millions of time-series destroy Prometheus. Use low-cardinality labels: method, status, endpoint (grouped). Never use user IDs, trace IDs, or UUIDs as labels.

🔍 Essential PromQL Queries

What You Want	PromQL Query
Request rate (req/sec)	`rate(http_requests_total[5m])`
Error rate %	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100`
p99 latency	`histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket[5m])) by (le))`
CPU usage %	`100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`
Memory used	`node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes`
Kafka consumer lag	`kafka_consumer_group_lag{topic="orders",partition="0"}`
DB connection pool saturation	`pg_stat_activity_count / pg_settings_max_connections`

🔗 Distributed Tracing: Trace & Span Model

A trace represents the complete journey of a single request through the system — from the client's HTTP request through every service, database call, and queue publish it touches.

A span represents a single unit of work within a trace (one service call, one DB query). Each span has:

trace_id — shared across all spans in the same request
span_id — unique to this operation
parent_span_id — the span that triggered this one (null for root span)
Start time + duration
Attributes (key-value context)
Status (OK / ERROR)

Trace: trace_id=4bf92f3577b34da6 span_id=00f067aa [API Gateway] GET /orders/123 0ms ──────────────── 87ms │ ├─ span_id=a3ce929d [Order Service] handle_request 2ms ──────── 83ms │ │ │ ├─ span_id=5e0c22e9 [DB: SELECT orders] 4ms ── 16ms (12ms) │ │ │ ├─ span_id=7f3d8a1b [Redis: GET cache] 18ms ─ 19ms (1ms) │ │ │ └─ span_id=2d4f991c [Kafka: publish event] 20ms ──── 34ms (14ms) │ └─ span_id=b1c5e072 [Auth Service] verify_token 1ms ─ 2ms (1ms) Flame chart: wider = longer. The DB SELECT at 12ms is the hot spot.

📡 W3C traceparent Header

The standard header for propagating trace context across HTTP service calls:

traceparent: 00-4bf92f3577b34da6a3ce929d-00f067aa0ba902b7-01 │ │ │ │ │ └─ trace_id (128-bit) └─ parent_span_id └─ flags └─ version (00) 01 = sampled 00 = not sampled

Each service: read traceparent from incoming request, create a child span (using the span_id as parent_span_id), set the new span's span_id, and propagate the updated traceparent in outgoing calls.

📦 OpenTelemetry (OTel)

OpenTelemetry is the CNCF standard for instrumentation — vendor-neutral SDK for generating traces, metrics, and logs.

SDK: available in C, Go, Java, Python, etc.
OTLP: OpenTelemetry Protocol — exports to any backend
OTel Collector: receives OTLP, processes (batch, sample), exports to Jaeger/Tempo/Datadog
Auto-instrumentation: inject tracing without changing application code (Java agent, eBPF)
Manual instrumentation: create custom spans for business logic

📊 Sampling Strategies

At high volume (10,000 req/sec), storing every trace is expensive. Sampling decides which traces to keep:

Strategy	How	Trade-off
Head-based: Always-on	Keep 100% of traces	Very expensive at scale
Head-based: Probability	Keep N% (e.g. 1%)	Simple, but misses rare errors
Head-based: Rate-limit	Keep up to N traces/sec	Bounded cost; may drop bursts
Tail-based: Error sampling	Buffer all traces; keep if trace has an error span	Catches errors; high memory buffer
Tail-based: Latency threshold	Keep if trace duration > P99 threshold	Catches slowness; complex to implement

Production recommendation: 1% head-based sampling for normal traffic + 100% sampling for traces with errors (tail-based error sampling). This keeps costs bounded while capturing all failure evidence.

🔔 Alerting with Prometheus AlertManager

Prometheus evaluates alerting rules against scraped metrics. When conditions are met, it sends alerts to AlertManager, which routes them to PagerDuty, Slack, email, etc.

# alerting_rules.yml groups: - name: order-service rules: - alert: HighErrorRate expr: | rate(http_requests_total{service="order-svc",status=~"5.."}[5m]) / rate(http_requests_total{service="order-svc"}[5m]) > 0.01 for: 2m # must be true for 2m before firing (avoid flapping) labels: severity: critical annotations: summary: "Error rate > 1% for order-service" runbook_url: "https://wiki/runbooks/order-svc-errors" - alert: HighP99Latency expr: | histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{service="order-svc"}[5m])) by (le)) > 2.0 for: 5m labels: severity: warning - alert: ServiceDown expr: up{service="order-svc"} == 0 for: 1m labels: severity: critical

📏 SLI, SLO, SLA

SLI (Service Level Indicator): a specific measurable metric — e.g., availability = (successful requests) / (total requests)
SLO (Service Level Objective): internal target — e.g., availability ≥ 99.9% over 30 days. Engineering commits to this.
SLA (Service Level Agreement): external contract with customers — stricter legal/financial penalties. SLO should be tighter than SLA as a safety buffer.
Error Budget: SLO headroom — 99.9% SLO = 0.1% budget = 43.8 min/month. Track burn rate.

💸 Error Budget & Burn Rate Alerts

Don't alert "SLO violated" (too late). Alert on burn rate — how fast you're consuming the error budget:

Fast burn: consuming 14× normal rate → will exhaust budget in 1h → page immediately
Slow burn: consuming 3× normal rate → will exhaust in ~5 days → ticket

# Fast burn alert: >14x budget consumption over 1h expr: | (1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) / (1 - 0.999) > 14

📋 Alert Design Principles

Alert on symptoms, not causes: Alert on high error rate (symptom users feel), not on "CPU at 80%" (cause — may not impact users)
Every alert needs a runbook: include runbook_url annotation. On-call engineers should never face an alert without documented response steps.
Avoid alert fatigue: if the same alert fires weekly and engineers silence it, it's not actionable. Remove or fix it.
Use for duration: require condition to be sustained before paging (avoids flapping on 1-second spikes)
Group related alerts: AlertManager can group 50 firing alerts into one notification — prevents notification flood during outages

🔒 OWASP Top 10 for Backend Services

Vulnerability	C/Backend Example	Prevention
SQL Injection	`"SELECT * FROM users WHERE id=" + user_id`	Parameterized queries only: `PQexecParams(conn, "SELECT... WHERE id=$1", 1, NULL, params, ...)`
Command Injection	`system("ls " + user_input)`	Never use `system()` with user input. Use `execv()` with argument array.
SSRF	Service fetches URL from user request body; attacker uses `http://169.254.169.254/` (AWS metadata)	Allowlist of permitted outbound domains; block RFC-1918 and link-local addresses
Broken Access Control	User A can read User B's orders by changing order_id in request	Check authorization on every resource: `WHERE id=$1 AND user_id=$2`
Security Misconfiguration	Debug endpoints enabled in prod, default credentials, verbose error messages	Separate prod config; disable `/debug` endpoints; return generic errors
Insecure Deserialization	Deserializing untrusted binary input (msgpack, protobuf from user)	Validate schema; set max sizes; reject unknown fields
Cryptographic Failures	MD5 for passwords, ECB mode, hardcoded keys	bcrypt/Argon2 for passwords; AES-256-GCM for encryption; libsodium

🛡️ SSRF Prevention in C

/* SSRF allowlist check before making outbound HTTP request */ #include <netdb.h> #include <arpa/inet.h> #include <string.h> /* Returns 1 if IP is RFC-1918 / link-local / loopback (block these) */ static int is_private_ip(const char *ip) { struct in_addr addr; if (!inet_pton(AF_INET, ip, &addr)) return 0; uint32_t n = ntohl(addr.s_addr); return (n >> 24 == 10) /* 10.0.0.0/8 */ || (n >> 20 == (172 << 4) + 1) /* 172.16.0.0/12 */ || (n >> 16 == (192 << 8) + 168) /* 192.168.0.0/16 */ || (n >> 24 == 127) /* 127.0.0.0/8 */ || (n >> 16 == (169 << 8) + 254); /* 169.254.0.0/16 */ } int safe_fetch_url(const char *url) { /* 1. Parse hostname from URL (simplified) */ char hostname[256]; sscanf(url, "https://%255[^/]", hostname); /* 2. Allowlist check: only permitted domains */ const char *allowed[] = { "api.stripe.com", "hooks.slack.com", NULL }; int permitted = 0; for (int i = 0; allowed[i]; i++) if (strcmp(hostname, allowed[i]) == 0) { permitted = 1; break; } if (!permitted) { fprintf(stderr, "SSRF blocked: %s not in allowlist\n", hostname); return -1; } /* 3. DNS resolution + IP check */ struct addrinfo *res; getaddrinfo(hostname, NULL, NULL, &res); char ip[INET6_ADDRSTRLEN]; inet_ntop(AF_INET, &((struct sockaddr_in *)res->ai_addr)->sin_addr, ip, sizeof(ip)); freeaddrinfo(res); if (is_private_ip(ip)) { fprintf(stderr, "SSRF blocked: %s resolved to private IP %s\n", hostname, ip); return -1; } /* 4. Make the actual HTTP request */ return 0; /* proceed */ }

🔑 Secrets Management

Approach	Security	Details
Hardcoded in source	❌ Never	Committed to git, all developers see it, forever in history
Environment variables	⚠️ Acceptable	Not in code but visible in process env, logs, crash dumps — use only with Kubernetes Secrets
Kubernetes Secrets	✅ Good	Base64 in etcd (encrypt etcd at rest); mounted as files or env; access controlled by RBAC
HashiCorp Vault	✅✅ Best	Dynamic secrets (generated on request, auto-expire), audit log, lease renewal, fine-grained access control
AWS Secrets Manager / GCP Secret Manager	✅✅ Best	Managed service equivalent; auto-rotation; IAM-controlled access

Dynamic secrets (Vault): instead of a long-lived DB password, Vault generates a unique username+password for each service instance with a 1-hour TTL. When the instance dies, the credential expires automatically. Breach impact is bounded in time and scope.

Secret rotation: rotate all secrets after any suspected breach. Never reuse credentials. Implement graceful rotation: support old and new secret simultaneously for 30s during rotation to avoid downtime.

✅ Input Validation: Trust Boundaries

Validate all data at trust boundaries (external inputs). Trust internal calls and framework output.

/* Input validation at trust boundary (HTTP request body) */ typedef struct { char order_id[37]; /* UUID: 36 chars + null */ double amount; int item_count; } order_request_t; int validate_order_request(const order_request_t *req) { /* Size bounds */ if (strlen(req->order_id) != 36) return -1; /* UUID format: 8-4-4-4-12 hex chars with dashes */ if (!is_valid_uuid(req->order_id)) return -1; /* Business rule bounds */ if (req->amount <= 0.0 || req->amount > 100000.0) return -1; if (req->item_count < 1 || req->item_count > 100) return -1; return 0; /* valid */ } /* Use allowlist validation, not denylist: know what's valid and reject everything else, rather than trying to enumerate all invalid inputs */

🚦 Rate Limiting Algorithms

Rate limiting protects services from overload and abusive clients. Implement at two layers:

API Gateway: global rate limiting per client IP or API key — protects all services
Per-service: self-defense against gateway bypass or internal traffic spikes

🪣 Token Bucket

A bucket holds up to capacity tokens. Tokens are added at rate r tokens/sec. Each request consumes one token. If the bucket is empty, the request is rejected.

Key property: allows bursts up to capacity while maintaining an average rate of r req/sec.

Refill rate: 10 tokens/sec Capacity: 20 tokens t=0: [████████████████████] 20 tokens → burst of 20 requests: OK t=0.1 [░░░░░░░░░░░░░░░░░░░░] 0 tokens → request: REJECT (429) t=0.5 [█████░░░░░░░░░░░░░░░] 5 tokens → 5 requests: OK t=1.0 [██████████░░░░░░░░░░] 10 tokens → 10 requests: OK Steady state: 10 req/sec sustained (burst allowed up to 20)

🪟 Sliding Window Counter

Divide time into fixed windows. Track request count in the current window and weight with the previous window's count. More accurate than a fixed window but O(1) memory.

Formula: count = prev_window_count × overlap_fraction + curr_window_count

/* Sliding window counter with Redis (pseudocode) */ long sliding_window_count(const char *client_key, int window_sec) { long now = time(NULL); long curr_window = now / window_sec; long prev_window = curr_window - 1; double elapsed = now % window_sec; double overlap = 1.0 - elapsed / window_sec; long prev_count = redis_get_counter(client_key, prev_window); long curr_count = redis_incr_counter(client_key, curr_window, window_sec); return (long)(prev_count * overlap + curr_count); }

⚖️ Algorithm Comparison

Algorithm	Burst Handling	Memory	Accuracy	Best For
Fixed Window Counter	Double burst at boundary (end+start of adjacent windows)	O(1)	Low (boundary problem)	Simple low-traffic systems
Sliding Window Log	Exact	O(requests in window)	Exact	Low-volume, exact limits needed
Sliding Window Counter	Approximate (±0.1%)	O(1)	High	Most production APIs
Token Bucket	Allows bursts up to capacity	O(1)	High	APIs tolerating short bursts
Leaky Bucket	Smooths all bursts, strict output rate	O(1)	High	Traffic shaping (network)

📤 Rate Limit Response Headers

Return these headers on every response so clients can self-throttle:

HTTP/1.1 200 OK X-RateLimit-Limit: 100 # max requests per window X-RateLimit-Remaining: 47 # requests left in current window X-RateLimit-Reset: 1711544400 # Unix timestamp when window resets HTTP/1.1 429 Too Many Requests Retry-After: 23 # seconds until client can retry X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 Content-Type: application/json {"error": "RATE_LIMIT_EXCEEDED", "retry_after_seconds": 23}

Never silently drop rate-limited requests. Return 429 with Retry-After so well-behaved clients back off correctly. Silently dropping causes clients to retry faster (thundering herd).

── Implementation 1 — Prometheus Metrics Exposition Endpoint ──

📡 Prometheus /metrics Endpoint in C

/* metrics.h — thread-safe counters and histograms for Prometheus exposition */ #pragma once #include <stdatomic.h> #include <stdio.h> #include <time.h> /* HTTP request counter: label dimensions = {method, status} */ typedef struct { _Atomic(long) get_2xx, get_4xx, get_5xx; _Atomic(long) post_2xx, post_4xx, post_5xx; } http_counters_t; /* Latency histogram: fixed buckets in seconds */ #define BUCKET_COUNT 7 static const double BUCKETS[BUCKET_COUNT] = { 0.005, 0.01, 0.025, 0.05, 0.1, 0.5, 1.0 }; typedef struct { _Atomic(long) bucket[BUCKET_COUNT]; _Atomic(long) count; _Atomic(double) sum; /* note: atomic double ops may need mutex on older C */ } latency_histogram_t; /* Global metrics state */ static http_counters_t g_http = {0}; static latency_histogram_t g_lat = {0}; static _Atomic(long) g_active_conn = 0; static inline void record_request(const char *method, int status, double dur_s) { /* Increment counter by method+status */ if (strcmp(method, "GET") == 0) { if (status < 300) atomic_fetch_add(&g_http.get_2xx, 1); else if (status < 500) atomic_fetch_add(&g_http.get_4xx, 1); else atomic_fetch_add(&g_http.get_5xx, 1); } /* Update histogram */ for (int i = 0; i < BUCKET_COUNT; i++) if (dur_s <= BUCKETS[i]) atomic_fetch_add(&g_lat.bucket[i], 1); atomic_fetch_add(&g_lat.count, 1); /* sum: use mutex for double precision (simplified: use long microseconds) */ atomic_fetch_add(&g_lat.count, 0); /* placeholder */ } /* Render /metrics response body into buf */ static inline int render_metrics(char *buf, size_t sz) { int n = 0; n += snprintf(buf + n, sz - n, "# HELP http_requests_total Total HTTP requests\n" "# TYPE http_requests_total counter\n" "http_requests_total{method=\"GET\",status=\"2xx\"} %ld\n" "http_requests_total{method=\"GET\",status=\"4xx\"} %ld\n" "http_requests_total{method=\"GET\",status=\"5xx\"} %ld\n", atomic_load(&g_http.get_2xx), atomic_load(&g_http.get_4xx), atomic_load(&g_http.get_5xx)); n += snprintf(buf + n, sz - n, "# HELP active_connections Current connections\n" "# TYPE active_connections gauge\n" "active_connections %ld\n", atomic_load(&g_active_conn)); n += snprintf(buf + n, sz - n, "# HELP request_duration_seconds Latency histogram\n" "# TYPE request_duration_seconds histogram\n"); for (int i = 0; i < BUCKET_COUNT; i++) n += snprintf(buf + n, sz - n, "request_duration_seconds_bucket{le=\"%.3f\"} %ld\n", BUCKETS[i], atomic_load(&g_lat.bucket[i])); n += snprintf(buf + n, sz - n, "request_duration_seconds_bucket{le=\"+Inf\"} %ld\n" "request_duration_seconds_count %ld\n", atomic_load(&g_lat.count), atomic_load(&g_lat.count)); return n; }

── Implementation 2 — Structured JSON Logger ──

📝 Structured JSON Logger with Trace ID

/* logger.h — structured JSON logger with trace context */ #pragma once #include <stdio.h> #include <time.h> #include <string.h> typedef struct { char trace_id[33]; /* 128-bit hex */ char span_id[17]; /* 64-bit hex */ } trace_ctx_t; /* Thread-local trace context */ static _Thread_local trace_ctx_t tl_trace = {0}; static inline void log_set_trace(const char *trace_id, const char *span_id) { strncpy(tl_trace.trace_id, trace_id, 32); strncpy(tl_trace.span_id, span_id, 16); tl_trace.trace_id[32] = tl_trace.span_id[16] = '\0'; } static inline const char *get_iso8601(char *buf, size_t n) { struct timespec ts; clock_gettime(CLOCK_REALTIME, &ts); struct tm *tm = gmtime(&ts.tv_sec); int len = strftime(buf, n, "%Y-%m-%dT%H:%M:%S", tm); snprintf(buf + len, n - len, ".%03ldZ", ts.tv_nsec / 1000000); return buf; } /* JSON-escape a string (handles quotes and backslashes) */ static inline void json_escape(char *dst, size_t dsz, const char *src) { size_t d = 0; for (size_t s = 0; src[s] && d + 2 < dsz; s++) { if (src[s] == '"' || src[s] == '\\') dst[d++] = '\\'; dst[d++] = src[s]; } dst[d] = '\0'; } #define LOG(level, msg_fmt, ...) do { \ char _ts[32], _msg[512], _esc[512]; \ get_iso8601(_ts, sizeof(_ts)); \ snprintf(_msg, sizeof(_msg), msg_fmt, ##__VA_ARGS__); \ json_escape(_esc, sizeof(_esc), _msg); \ fprintf(stdout, \ "{\"ts\":\"%s\",\"level\":\"%s\",\"service\":\"order-svc\"," \ "\"trace_id\":\"%s\",\"span_id\":\"%s\",\"msg\":\"%s\"}\n", \ _ts, level, tl_trace.trace_id, tl_trace.span_id, _esc); \ } while(0) #define LOG_INFO(fmt, ...) LOG("INFO", fmt, ##__VA_ARGS__) #define LOG_WARN(fmt, ...) LOG("WARN", fmt, ##__VA_ARGS__) #define LOG_ERROR(fmt, ...) LOG("ERROR", fmt, ##__VA_ARGS__) /* Usage: */ /* log_set_trace("4bf92f3577b34da6a3ce929d", "00f067aa0ba902b7"); */ /* LOG_INFO("order placed order_id=%s amount=%.2f", order_id, amount); */

── Implementation 3 — Token Bucket Rate Limiter ──

🪣 Token Bucket Rate Limiter (Thread-Safe, C11 Atomics)

/* token_bucket.h — thread-safe token bucket rate limiter */ #pragma once #include <stdatomic.h> #include <time.h> #include <stdbool.h> typedef struct { _Atomic(long) tokens_us; /* tokens * 1e6 (avoid float atomics) */ _Atomic(long) last_refill_us; /* last refill time in microseconds */ long capacity_us; /* max tokens * 1e6 */ long rate_us; /* tokens added per microsecond * 1e6 */ } token_bucket_t; static inline long now_us(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1000000LL + ts.tv_nsec / 1000; } static inline void tb_init(token_bucket_t *tb, double rate_per_sec, double capacity) { tb->capacity_us = (long)(capacity * 1e6); tb->rate_us = (long)(rate_per_sec); /* tokens/sec → tokens/us = rate/1e6 */ atomic_store(&tb->tokens_us, tb->capacity_us); atomic_store(&tb->last_refill_us, now_us()); } /* Returns true if request is allowed; false if rate-limited */ static inline bool tb_allow(token_bucket_t *tb) { long now = now_us(); long last = atomic_exchange(&tb->last_refill_us, now); long elapsed_us = now - last; /* Add tokens for elapsed time: tokens += rate * elapsed_us */ long new_tokens = tb->rate_us * elapsed_us / 1000000; if (new_tokens > 0) { long current = atomic_fetch_add(&tb->tokens_us, new_tokens * 1000000LL); /* Cap at capacity */ if (current + new_tokens * 1000000LL > tb->capacity_us) atomic_store(&tb->tokens_us, tb->capacity_us); } /* Try to consume one token */ long one_token = 1000000LL; long prev = atomic_fetch_sub(&tb->tokens_us, one_token); if (prev >= one_token) return true; /* allowed */ /* Not enough tokens: restore */ atomic_fetch_add(&tb->tokens_us, one_token); return false; /* rate limited */ } /* Usage: */ /* token_bucket_t per_client_bucket; */ /* tb_init(&per_client_bucket, 100.0, 200.0); // 100 req/sec, burst=200 */ /* if (!tb_allow(&per_client_bucket)) { respond_429(); return; } */

🔬 Lab 1 — Prometheus Metrics + Grafana Dashboard

Instrument a C service and build a RED dashboard in Grafana.

1 Add the metrics module from Tab 8 to the health check HTTP server (M15 Tab 8). Expose /metrics on port 8081 alongside /health/live.

2 Run Prometheus locally: docker run -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus. Configure prometheus.yml to scrape localhost:8081/metrics every 15s.

3 Generate load with wrk -t4 -c100 -d30s http://localhost:8080/orders. Watch metrics accumulate at http://localhost:9090.

4 Run Grafana: docker run -p 3000:3000 grafana/grafana. Add Prometheus as data source. Build a RED dashboard with three panels: request rate, error rate %, p99 latency.

5 Add a Prometheus alerting rule: fire if error rate > 1% for 2 minutes. Simulate errors by making your handler return 500 randomly. Watch the alert move from PENDING to FIRING.

🔬 Lab 2 — Distributed Trace Propagation

Implement W3C traceparent propagation across two C services.

1 Build Service A (port 8080) and Service B (port 8081). Both run the structured logger from Tab 8.

2 Service A: on each request, generate a trace_id (UUID) + span_id. If the request already has traceparent, extract the trace_id and create a child span_id.

3 Service A calls Service B via HTTP, forwarding the traceparent header. Service B logs with the same trace_id from the header.

4 Send a request to Service A. In the combined log output, grep for the trace_id — verify both service logs appear with the same trace_id, showing the full request chain.

5 Bonus: run Jaeger locally (docker run -p 16686:16686 jaegertracing/all-in-one). Use the OTel C SDK to export spans to Jaeger and view the trace waterfall.

🔬 Lab 3 — Token Bucket Rate Limiter Under Load

Verify rate limiter correctness under concurrent load.

1 Integrate the token bucket from Tab 8 into the HTTP server. Return 429 Too Many Requests with Retry-After header when rate-limited.

2 Configure: 100 req/sec rate, 150 token burst capacity.

3 Burst test: send 200 requests simultaneously. Verify exactly ~150 succeed and ~50 receive 429. Check Prometheus counter: http_requests_total{status="4xx"}.

4 Sustained test: wrk -t8 -c100 -d60s at 1000 req/sec (10× limit). Verify roughly 100 req/sec succeed and the rest 429. The rate should be stable over the 60s window.

5 Concurrency test: 8 threads all decrementing the same bucket simultaneously for 10 seconds. Verify no race conditions using TSan: clang -fsanitize=thread.

🔬 Lab 4 — Security: SQL Injection & SSRF Prevention

Demonstrate vulnerability and fix in C using libpq.

1 Write a vulnerable handler: char query[256]; snprintf(query, sizeof(query), "SELECT * FROM orders WHERE id='%s'", user_input); PQexec(conn, query);. Try input: '; DROP TABLE orders; --. Verify it executes.

2 Fix it: use PQexecParams(conn, "SELECT * FROM orders WHERE id=$1", 1, NULL, params, NULL, NULL, 0). Retry the injection — verify it returns no results (treats the entire input as a literal string).

3 Write the SSRF-vulnerable handler: accept a URL from request body and fetch it with libcurl. Try the AWS metadata endpoint: http://169.254.169.254/latest/meta-data/. Verify it returns data.

4 Fix it: integrate safe_fetch_url() from Tab 6. Verify the metadata URL is blocked. Verify a legitimate allowlisted URL succeeds.

── Phase 7 Mastery Checklist ──

Observability

Explain the 3 pillars and what question each answers
Write a structured JSON log line with all mandatory fields
List what must never appear in logs (secrets, PII)
Explain how trace_id links logs across services

Metrics

Distinguish counter, gauge, histogram, summary
Write PromQL for request rate, error rate %, and p99 latency
Apply RED method to a service and USE method to a resource
Explain why high-cardinality labels are dangerous in Prometheus
Write a Prometheus alerting rule with for duration

Tracing

Explain trace, span, parent_span_id relationship
Parse and construct a W3C traceparent header
Choose the right sampling strategy for given traffic/budget

Alerting & SLO

Define SLI, SLO, SLA, Error Budget
Write a burn rate alert (faster than threshold-crossing alerts)
Explain why alerting on symptoms is better than causes

Security

Fix SQL injection with parameterized queries in libpq
Block SSRF with allowlist + RFC-1918 IP check
Explain broken access control with a concrete example
Describe dynamic secrets (Vault) vs static env vars
Write allowlist input validation for a struct field

Rate Limiting

Implement token bucket: capacity, rate, burst
Compare sliding window counter vs fixed window (boundary problem)
Return correct 429 + Retry-After + X-RateLimit-* headers

← M15: Microservices & Infrastructure ↑ Roadmap Batch 2 coming soon →