M17 — Observability & Hardening

Phase 7 3 pillars: logs · metrics · traces · Prometheus & PromQL · Distributed tracing & OpenTelemetry · Rate limiting · OWASP Top 10 · Secrets management · Graceful shutdown
🔭 The 3 Pillars of Observability
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars each answer a different question:
PillarQuestion AnsweredData TypeTools
Logs"What happened, exactly?"Discrete events with contextELK, Loki, Fluentd
Metrics"How fast / how many / how full?"Aggregated numeric time-seriesPrometheus, Grafana, Datadog
Traces"Why was this request slow?"Causal chains across servicesJaeger, Tempo, Zipkin, Honeycomb
The three pillars are complementary, not interchangeable. An alert fires on a metric (high p99 latency). You look at a trace to find the slow span. You look at logs from that span to see the exact error. Use all three together.
Analogy — The flight data recorder:
Logs are the cockpit voice recorder — full narrative of what was said. Metrics are the flight data recorder — altitude, speed, attitude plotted over time. Traces are the air traffic control replay — the full path of the aircraft from departure to destination. An accident investigation uses all three.
📐 Observability vs Monitoring
MonitoringObservability
ApproachPredefined thresholds and dashboards for known failure modesAbility to ask arbitrary questions about system behavior
LimitsOnly catches failures you anticipated and built alerts forEnables debugging novel, unknown failure modes
DataAggregated metrics, simple health checksLogs + metrics + traces with high cardinality
ToolingNagios, simple dashboardsOpenTelemetry, Honeycomb, Grafana + Loki + Tempo
Start with monitoring (dashboards for known metrics, alerts on thresholds). Add observability as system complexity grows — when you start debugging failures you didn't anticipate.
📐 Phase 7 Module Map
ModuleTopicKey Concepts
M17 (this)Observability & HardeningLogs, metrics, traces, alerting, SLO, security, rate limiting
M18Performance EngineeringProfiling, flame graphs, memory analysis, benchmark methodology
Prerequisites: Ph6 (Microservices — you need services to observe; health probes from M15 are the basis of readiness checks here)
📝 Structured Logging: JSON Lines Format
Write one JSON object per line to stdout. Every log line must include mandatory fields for searchability:
/* Good: structured JSON log */ {"ts":"2026-03-27T14:23:01.442Z","level":"INFO","service":"order-svc", "trace_id":"4bf92f3577b34da6","span_id":"00f067aa0ba902b7", "msg":"order placed","order_id":"ord-9821","user_id":"u-44","amount_usd":49.99} /* Bad: unstructured text — unsearchable */ [2026-03-27 14:23:01] INFO: Order ord-9821 placed by user u-44 for $49.99
📋 Mandatory Log Fields
FieldFormatPurpose
tsISO-8601 with msTimeline reconstruction
levelDEBUG/INFO/WARN/ERROR/FATALLog level filtering
serviceservice nameMulti-service log aggregation
trace_idhex stringCorrelate with traces
span_idhex stringCorrelate with specific span
msghuman-readableEvent description
🎚️ Log Levels — When to Use Each
LevelUse When
DEBUGVerbose detail for local dev only. Never in production — log volume explosion.
INFONormal business events: request received, order placed, job started.
WARNDegraded but recoverable: retry succeeded, cache miss, approaching limit.
ERRORUnexpected failure requiring attention: DB timeout, invalid state, downstream error.
FATALUnrecoverable — process will exit after logging.
🚫 What NOT to Log
Never LogWhyAlternative
Passwords, API keys, tokensLog aggregators, retention, and breach exposureLog presence/absence, not value
Full credit card numbersPCI-DSS violationLog last 4 digits only
PII (emails, SSN, full name)GDPR/CCPA violationLog user_id (opaque reference)
Full request/response bodiesVolume, PII riskLog status codes and latency only
Health probe hits (/health/*)Thousands/min of noiseFilter at log aggregator
Log injection: never embed user-supplied strings directly in log messages without sanitization. A user who sets their name to ","level":"ERROR","msg":"admin escalation can forge log entries. Escape or use parameterized logging.
🔗 Correlation: Linking Logs Across Services
The trace_id links all log lines for a single request across every service it touches:
/* API Gateway generates trace_id */ GET /orders/123 → trace_id: 4bf92f3577b34da6 /* Order Service log */ {"service":"order-svc","trace_id":"4bf92f3577b34da6","msg":"fetching order"} /* Database call log */ {"service":"order-svc","trace_id":"4bf92f3577b34da6","msg":"db query","duration_ms":12} /* In Loki/Elasticsearch: search by trace_id to see full request timeline */ {trace_id="4bf92f3577b34da6"} | json | sort by ts
In Grafana with Loki + Tempo integration: click a trace in Tempo → jump directly to correlated logs in Loki for that trace_id. This cross-pillar navigation is the power of consistent trace_id propagation.
📊 Metric Types
TypePropertiesExamplePromQL Usage
CounterMonotonically increasing, never decreases, resets to 0 on restarthttp_requests_total{method="GET",status="200"}rate(http_requests_total[5m]) → requests/sec
GaugePoint-in-time value, can go up or downactive_connections, memory_usage_bytes, queue_depthDirect: active_connections > 1000
HistogramSamples bucketed by value; provides _count, _sum, _bucketrequest_duration_seconds{le="0.1"}histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))
SummaryPre-computed quantiles on client side (less flexible)request_duration_seconds{quantile="0.99"}Direct quantile access; can't re-aggregate across instances
Prefer Histogram over Summary. Histograms can be aggregated across multiple instances (e.g., p99 across all pods). Summaries compute quantiles per-process and can't be aggregated.
🔴 RED Method (Services)
The minimal set of metrics for any service:
  • Rate: requests per second — is traffic normal?
    rate(http_requests_total[5m])
  • Errors: error rate — are users experiencing failures?
    rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
  • Duration: latency percentiles — is it slow?
    histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))
Alert on all three. High error rate → immediate page. High p99 → investigate. Low rate → traffic drop (upstream issue or deploy broke routing).
📊 USE Method (Resources)
For infrastructure resources (CPU, memory, disk, network):
  • Utilization: % time resource is busy
    rate(process_cpu_seconds_total[1m]) * 100
  • Saturation: extra work queued (can't keep up)
    node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)
  • Errors: error count or rate of resource
    node_disk_io_time_weighted_seconds_total
High utilization alone isn't a problem. High utilization + high saturation = at capacity. Alert when saturation exceeds 0 (work is queued).
📡 Prometheus: Pull-Based Scraping & Exposition Format
Prometheus polls your service's /metrics endpoint on a configurable interval (typically 15s). Your service exposes metrics in the Prometheus text format:
# HELP http_requests_total Total HTTP requests by method and status code # TYPE http_requests_total counter http_requests_total{method="GET",status="200"} 14823 http_requests_total{method="POST",status="201"} 3291 http_requests_total{method="GET",status="500"} 42 # HELP request_duration_seconds Request latency histogram # TYPE request_duration_seconds histogram request_duration_seconds_bucket{le="0.01"} 8901 request_duration_seconds_bucket{le="0.05"} 13102 request_duration_seconds_bucket{le="0.1"} 14651 request_duration_seconds_bucket{le="0.5"} 14820 request_duration_seconds_bucket{le="+Inf"} 14823 request_duration_seconds_sum 891.23 request_duration_seconds_count 14823 # HELP active_connections Current active connections # TYPE active_connections gauge active_connections 47
Labels are high-cardinality risk. A label like {user_id="..."} creates one time-series per user — millions of time-series destroy Prometheus. Use low-cardinality labels: method, status, endpoint (grouped). Never use user IDs, trace IDs, or UUIDs as labels.
🔍 Essential PromQL Queries
What You WantPromQL Query
Request rate (req/sec)rate(http_requests_total[5m])
Error rate %rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
p99 latencyhistogram_quantile(0.99, sum(rate(request_duration_seconds_bucket[5m])) by (le))
CPU usage %100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usednode_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
Kafka consumer lagkafka_consumer_group_lag{topic="orders",partition="0"}
DB connection pool saturationpg_stat_activity_count / pg_settings_max_connections
🔗 Distributed Tracing: Trace & Span Model
A trace represents the complete journey of a single request through the system — from the client's HTTP request through every service, database call, and queue publish it touches.

A span represents a single unit of work within a trace (one service call, one DB query). Each span has:
  • trace_id — shared across all spans in the same request
  • span_id — unique to this operation
  • parent_span_id — the span that triggered this one (null for root span)
  • Start time + duration
  • Attributes (key-value context)
  • Status (OK / ERROR)
Trace: trace_id=4bf92f3577b34da6 span_id=00f067aa [API Gateway] GET /orders/123 0ms ──────────────── 87ms │ ├─ span_id=a3ce929d [Order Service] handle_request 2ms ──────── 83ms │ │ │ ├─ span_id=5e0c22e9 [DB: SELECT orders] 4ms ── 16ms (12ms) │ │ │ ├─ span_id=7f3d8a1b [Redis: GET cache] 18ms ─ 19ms (1ms) │ │ │ └─ span_id=2d4f991c [Kafka: publish event] 20ms ──── 34ms (14ms) │ └─ span_id=b1c5e072 [Auth Service] verify_token 1ms ─ 2ms (1ms) Flame chart: wider = longer. The DB SELECT at 12ms is the hot spot.
📡 W3C traceparent Header
The standard header for propagating trace context across HTTP service calls:
traceparent: 00-4bf92f3577b34da6a3ce929d-00f067aa0ba902b7-01 │ │ │ │ │ └─ trace_id (128-bit) └─ parent_span_id └─ flags └─ version (00) 01 = sampled 00 = not sampled
Each service: read traceparent from incoming request, create a child span (using the span_id as parent_span_id), set the new span's span_id, and propagate the updated traceparent in outgoing calls.
📦 OpenTelemetry (OTel)
OpenTelemetry is the CNCF standard for instrumentation — vendor-neutral SDK for generating traces, metrics, and logs.
  • SDK: available in C, Go, Java, Python, etc.
  • OTLP: OpenTelemetry Protocol — exports to any backend
  • OTel Collector: receives OTLP, processes (batch, sample), exports to Jaeger/Tempo/Datadog
  • Auto-instrumentation: inject tracing without changing application code (Java agent, eBPF)
  • Manual instrumentation: create custom spans for business logic
📊 Sampling Strategies
At high volume (10,000 req/sec), storing every trace is expensive. Sampling decides which traces to keep:
StrategyHowTrade-off
Head-based: Always-onKeep 100% of tracesVery expensive at scale
Head-based: ProbabilityKeep N% (e.g. 1%)Simple, but misses rare errors
Head-based: Rate-limitKeep up to N traces/secBounded cost; may drop bursts
Tail-based: Error samplingBuffer all traces; keep if trace has an error spanCatches errors; high memory buffer
Tail-based: Latency thresholdKeep if trace duration > P99 thresholdCatches slowness; complex to implement
Production recommendation: 1% head-based sampling for normal traffic + 100% sampling for traces with errors (tail-based error sampling). This keeps costs bounded while capturing all failure evidence.
🔔 Alerting with Prometheus AlertManager
Prometheus evaluates alerting rules against scraped metrics. When conditions are met, it sends alerts to AlertManager, which routes them to PagerDuty, Slack, email, etc.
# alerting_rules.yml groups: - name: order-service rules: - alert: HighErrorRate expr: | rate(http_requests_total{service="order-svc",status=~"5.."}[5m]) / rate(http_requests_total{service="order-svc"}[5m]) > 0.01 for: 2m # must be true for 2m before firing (avoid flapping) labels: severity: critical annotations: summary: "Error rate > 1% for order-service" runbook_url: "https://wiki/runbooks/order-svc-errors" - alert: HighP99Latency expr: | histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{service="order-svc"}[5m])) by (le)) > 2.0 for: 5m labels: severity: warning - alert: ServiceDown expr: up{service="order-svc"} == 0 for: 1m labels: severity: critical
📏 SLI, SLO, SLA
  • SLI (Service Level Indicator): a specific measurable metric — e.g., availability = (successful requests) / (total requests)
  • SLO (Service Level Objective): internal target — e.g., availability ≥ 99.9% over 30 days. Engineering commits to this.
  • SLA (Service Level Agreement): external contract with customers — stricter legal/financial penalties. SLO should be tighter than SLA as a safety buffer.
  • Error Budget: SLO headroom — 99.9% SLO = 0.1% budget = 43.8 min/month. Track burn rate.
💸 Error Budget & Burn Rate Alerts
Don't alert "SLO violated" (too late). Alert on burn rate — how fast you're consuming the error budget:
  • Fast burn: consuming 14× normal rate → will exhaust budget in 1h → page immediately
  • Slow burn: consuming 3× normal rate → will exhaust in ~5 days → ticket
# Fast burn alert: >14x budget consumption over 1h expr: | (1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h])))) / (1 - 0.999) > 14
📋 Alert Design Principles
  • Alert on symptoms, not causes: Alert on high error rate (symptom users feel), not on "CPU at 80%" (cause — may not impact users)
  • Every alert needs a runbook: include runbook_url annotation. On-call engineers should never face an alert without documented response steps.
  • Avoid alert fatigue: if the same alert fires weekly and engineers silence it, it's not actionable. Remove or fix it.
  • Use for duration: require condition to be sustained before paging (avoids flapping on 1-second spikes)
  • Group related alerts: AlertManager can group 50 firing alerts into one notification — prevents notification flood during outages
🔒 OWASP Top 10 for Backend Services
VulnerabilityC/Backend ExamplePrevention
SQL Injection"SELECT * FROM users WHERE id=" + user_idParameterized queries only: PQexecParams(conn, "SELECT... WHERE id=$1", 1, NULL, params, ...)
Command Injectionsystem("ls " + user_input)Never use system() with user input. Use execv() with argument array.
SSRFService fetches URL from user request body; attacker uses http://169.254.169.254/ (AWS metadata)Allowlist of permitted outbound domains; block RFC-1918 and link-local addresses
Broken Access ControlUser A can read User B's orders by changing order_id in requestCheck authorization on every resource: WHERE id=$1 AND user_id=$2
Security MisconfigurationDebug endpoints enabled in prod, default credentials, verbose error messagesSeparate prod config; disable /debug endpoints; return generic errors
Insecure DeserializationDeserializing untrusted binary input (msgpack, protobuf from user)Validate schema; set max sizes; reject unknown fields
Cryptographic FailuresMD5 for passwords, ECB mode, hardcoded keysbcrypt/Argon2 for passwords; AES-256-GCM for encryption; libsodium
🛡️ SSRF Prevention in C
/* SSRF allowlist check before making outbound HTTP request */ #include <netdb.h> #include <arpa/inet.h> #include <string.h> /* Returns 1 if IP is RFC-1918 / link-local / loopback (block these) */ static int is_private_ip(const char *ip) { struct in_addr addr; if (!inet_pton(AF_INET, ip, &addr)) return 0; uint32_t n = ntohl(addr.s_addr); return (n >> 24 == 10) /* 10.0.0.0/8 */ || (n >> 20 == (172 << 4) + 1) /* 172.16.0.0/12 */ || (n >> 16 == (192 << 8) + 168) /* 192.168.0.0/16 */ || (n >> 24 == 127) /* 127.0.0.0/8 */ || (n >> 16 == (169 << 8) + 254); /* 169.254.0.0/16 */ } int safe_fetch_url(const char *url) { /* 1. Parse hostname from URL (simplified) */ char hostname[256]; sscanf(url, "https://%255[^/]", hostname); /* 2. Allowlist check: only permitted domains */ const char *allowed[] = { "api.stripe.com", "hooks.slack.com", NULL }; int permitted = 0; for (int i = 0; allowed[i]; i++) if (strcmp(hostname, allowed[i]) == 0) { permitted = 1; break; } if (!permitted) { fprintf(stderr, "SSRF blocked: %s not in allowlist\n", hostname); return -1; } /* 3. DNS resolution + IP check */ struct addrinfo *res; getaddrinfo(hostname, NULL, NULL, &res); char ip[INET6_ADDRSTRLEN]; inet_ntop(AF_INET, &((struct sockaddr_in *)res->ai_addr)->sin_addr, ip, sizeof(ip)); freeaddrinfo(res); if (is_private_ip(ip)) { fprintf(stderr, "SSRF blocked: %s resolved to private IP %s\n", hostname, ip); return -1; } /* 4. Make the actual HTTP request */ return 0; /* proceed */ }
🔑 Secrets Management
ApproachSecurityDetails
Hardcoded in source❌ NeverCommitted to git, all developers see it, forever in history
Environment variables⚠️ AcceptableNot in code but visible in process env, logs, crash dumps — use only with Kubernetes Secrets
Kubernetes Secrets✅ GoodBase64 in etcd (encrypt etcd at rest); mounted as files or env; access controlled by RBAC
HashiCorp Vault✅✅ BestDynamic secrets (generated on request, auto-expire), audit log, lease renewal, fine-grained access control
AWS Secrets Manager / GCP Secret Manager✅✅ BestManaged service equivalent; auto-rotation; IAM-controlled access
Dynamic secrets (Vault): instead of a long-lived DB password, Vault generates a unique username+password for each service instance with a 1-hour TTL. When the instance dies, the credential expires automatically. Breach impact is bounded in time and scope.
Secret rotation: rotate all secrets after any suspected breach. Never reuse credentials. Implement graceful rotation: support old and new secret simultaneously for 30s during rotation to avoid downtime.
✅ Input Validation: Trust Boundaries
Validate all data at trust boundaries (external inputs). Trust internal calls and framework output.
/* Input validation at trust boundary (HTTP request body) */ typedef struct { char order_id[37]; /* UUID: 36 chars + null */ double amount; int item_count; } order_request_t; int validate_order_request(const order_request_t *req) { /* Size bounds */ if (strlen(req->order_id) != 36) return -1; /* UUID format: 8-4-4-4-12 hex chars with dashes */ if (!is_valid_uuid(req->order_id)) return -1; /* Business rule bounds */ if (req->amount <= 0.0 || req->amount > 100000.0) return -1; if (req->item_count < 1 || req->item_count > 100) return -1; return 0; /* valid */ } /* Use allowlist validation, not denylist: know what's valid and reject everything else, rather than trying to enumerate all invalid inputs */
🚦 Rate Limiting Algorithms
Rate limiting protects services from overload and abusive clients. Implement at two layers:
  • API Gateway: global rate limiting per client IP or API key — protects all services
  • Per-service: self-defense against gateway bypass or internal traffic spikes
🪣 Token Bucket
A bucket holds up to capacity tokens. Tokens are added at rate r tokens/sec. Each request consumes one token. If the bucket is empty, the request is rejected.

Key property: allows bursts up to capacity while maintaining an average rate of r req/sec.
Refill rate: 10 tokens/sec Capacity: 20 tokens t=0: [████████████████████] 20 tokens → burst of 20 requests: OK t=0.1 [░░░░░░░░░░░░░░░░░░░░] 0 tokens → request: REJECT (429) t=0.5 [█████░░░░░░░░░░░░░░░] 5 tokens → 5 requests: OK t=1.0 [██████████░░░░░░░░░░] 10 tokens → 10 requests: OK Steady state: 10 req/sec sustained (burst allowed up to 20)
🪟 Sliding Window Counter
Divide time into fixed windows. Track request count in the current window and weight with the previous window's count. More accurate than a fixed window but O(1) memory.

Formula: count = prev_window_count × overlap_fraction + curr_window_count
/* Sliding window counter with Redis (pseudocode) */ long sliding_window_count(const char *client_key, int window_sec) { long now = time(NULL); long curr_window = now / window_sec; long prev_window = curr_window - 1; double elapsed = now % window_sec; double overlap = 1.0 - elapsed / window_sec; long prev_count = redis_get_counter(client_key, prev_window); long curr_count = redis_incr_counter(client_key, curr_window, window_sec); return (long)(prev_count * overlap + curr_count); }
⚖️ Algorithm Comparison
AlgorithmBurst HandlingMemoryAccuracyBest For
Fixed Window CounterDouble burst at boundary (end+start of adjacent windows)O(1)Low (boundary problem)Simple low-traffic systems
Sliding Window LogExactO(requests in window)ExactLow-volume, exact limits needed
Sliding Window CounterApproximate (±0.1%)O(1)HighMost production APIs
Token BucketAllows bursts up to capacityO(1)HighAPIs tolerating short bursts
Leaky BucketSmooths all bursts, strict output rateO(1)HighTraffic shaping (network)
📤 Rate Limit Response Headers
Return these headers on every response so clients can self-throttle:
HTTP/1.1 200 OK X-RateLimit-Limit: 100 # max requests per window X-RateLimit-Remaining: 47 # requests left in current window X-RateLimit-Reset: 1711544400 # Unix timestamp when window resets HTTP/1.1 429 Too Many Requests Retry-After: 23 # seconds until client can retry X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 Content-Type: application/json {"error": "RATE_LIMIT_EXCEEDED", "retry_after_seconds": 23}
Never silently drop rate-limited requests. Return 429 with Retry-After so well-behaved clients back off correctly. Silently dropping causes clients to retry faster (thundering herd).
── Implementation 1 — Prometheus Metrics Exposition Endpoint ──
📡 Prometheus /metrics Endpoint in C
/* metrics.h — thread-safe counters and histograms for Prometheus exposition */ #pragma once #include <stdatomic.h> #include <stdio.h> #include <time.h> /* HTTP request counter: label dimensions = {method, status} */ typedef struct { _Atomic(long) get_2xx, get_4xx, get_5xx; _Atomic(long) post_2xx, post_4xx, post_5xx; } http_counters_t; /* Latency histogram: fixed buckets in seconds */ #define BUCKET_COUNT 7 static const double BUCKETS[BUCKET_COUNT] = { 0.005, 0.01, 0.025, 0.05, 0.1, 0.5, 1.0 }; typedef struct { _Atomic(long) bucket[BUCKET_COUNT]; _Atomic(long) count; _Atomic(double) sum; /* note: atomic double ops may need mutex on older C */ } latency_histogram_t; /* Global metrics state */ static http_counters_t g_http = {0}; static latency_histogram_t g_lat = {0}; static _Atomic(long) g_active_conn = 0; static inline void record_request(const char *method, int status, double dur_s) { /* Increment counter by method+status */ if (strcmp(method, "GET") == 0) { if (status < 300) atomic_fetch_add(&g_http.get_2xx, 1); else if (status < 500) atomic_fetch_add(&g_http.get_4xx, 1); else atomic_fetch_add(&g_http.get_5xx, 1); } /* Update histogram */ for (int i = 0; i < BUCKET_COUNT; i++) if (dur_s <= BUCKETS[i]) atomic_fetch_add(&g_lat.bucket[i], 1); atomic_fetch_add(&g_lat.count, 1); /* sum: use mutex for double precision (simplified: use long microseconds) */ atomic_fetch_add(&g_lat.count, 0); /* placeholder */ } /* Render /metrics response body into buf */ static inline int render_metrics(char *buf, size_t sz) { int n = 0; n += snprintf(buf + n, sz - n, "# HELP http_requests_total Total HTTP requests\n" "# TYPE http_requests_total counter\n" "http_requests_total{method=\"GET\",status=\"2xx\"} %ld\n" "http_requests_total{method=\"GET\",status=\"4xx\"} %ld\n" "http_requests_total{method=\"GET\",status=\"5xx\"} %ld\n", atomic_load(&g_http.get_2xx), atomic_load(&g_http.get_4xx), atomic_load(&g_http.get_5xx)); n += snprintf(buf + n, sz - n, "# HELP active_connections Current connections\n" "# TYPE active_connections gauge\n" "active_connections %ld\n", atomic_load(&g_active_conn)); n += snprintf(buf + n, sz - n, "# HELP request_duration_seconds Latency histogram\n" "# TYPE request_duration_seconds histogram\n"); for (int i = 0; i < BUCKET_COUNT; i++) n += snprintf(buf + n, sz - n, "request_duration_seconds_bucket{le=\"%.3f\"} %ld\n", BUCKETS[i], atomic_load(&g_lat.bucket[i])); n += snprintf(buf + n, sz - n, "request_duration_seconds_bucket{le=\"+Inf\"} %ld\n" "request_duration_seconds_count %ld\n", atomic_load(&g_lat.count), atomic_load(&g_lat.count)); return n; }
── Implementation 2 — Structured JSON Logger ──
📝 Structured JSON Logger with Trace ID
/* logger.h — structured JSON logger with trace context */ #pragma once #include <stdio.h> #include <time.h> #include <string.h> typedef struct { char trace_id[33]; /* 128-bit hex */ char span_id[17]; /* 64-bit hex */ } trace_ctx_t; /* Thread-local trace context */ static _Thread_local trace_ctx_t tl_trace = {0}; static inline void log_set_trace(const char *trace_id, const char *span_id) { strncpy(tl_trace.trace_id, trace_id, 32); strncpy(tl_trace.span_id, span_id, 16); tl_trace.trace_id[32] = tl_trace.span_id[16] = '\0'; } static inline const char *get_iso8601(char *buf, size_t n) { struct timespec ts; clock_gettime(CLOCK_REALTIME, &ts); struct tm *tm = gmtime(&ts.tv_sec); int len = strftime(buf, n, "%Y-%m-%dT%H:%M:%S", tm); snprintf(buf + len, n - len, ".%03ldZ", ts.tv_nsec / 1000000); return buf; } /* JSON-escape a string (handles quotes and backslashes) */ static inline void json_escape(char *dst, size_t dsz, const char *src) { size_t d = 0; for (size_t s = 0; src[s] && d + 2 < dsz; s++) { if (src[s] == '"' || src[s] == '\\') dst[d++] = '\\'; dst[d++] = src[s]; } dst[d] = '\0'; } #define LOG(level, msg_fmt, ...) do { \ char _ts[32], _msg[512], _esc[512]; \ get_iso8601(_ts, sizeof(_ts)); \ snprintf(_msg, sizeof(_msg), msg_fmt, ##__VA_ARGS__); \ json_escape(_esc, sizeof(_esc), _msg); \ fprintf(stdout, \ "{\"ts\":\"%s\",\"level\":\"%s\",\"service\":\"order-svc\"," \ "\"trace_id\":\"%s\",\"span_id\":\"%s\",\"msg\":\"%s\"}\n", \ _ts, level, tl_trace.trace_id, tl_trace.span_id, _esc); \ } while(0) #define LOG_INFO(fmt, ...) LOG("INFO", fmt, ##__VA_ARGS__) #define LOG_WARN(fmt, ...) LOG("WARN", fmt, ##__VA_ARGS__) #define LOG_ERROR(fmt, ...) LOG("ERROR", fmt, ##__VA_ARGS__) /* Usage: */ /* log_set_trace("4bf92f3577b34da6a3ce929d", "00f067aa0ba902b7"); */ /* LOG_INFO("order placed order_id=%s amount=%.2f", order_id, amount); */
── Implementation 3 — Token Bucket Rate Limiter ──
🪣 Token Bucket Rate Limiter (Thread-Safe, C11 Atomics)
/* token_bucket.h — thread-safe token bucket rate limiter */ #pragma once #include <stdatomic.h> #include <time.h> #include <stdbool.h> typedef struct { _Atomic(long) tokens_us; /* tokens * 1e6 (avoid float atomics) */ _Atomic(long) last_refill_us; /* last refill time in microseconds */ long capacity_us; /* max tokens * 1e6 */ long rate_us; /* tokens added per microsecond * 1e6 */ } token_bucket_t; static inline long now_us(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1000000LL + ts.tv_nsec / 1000; } static inline void tb_init(token_bucket_t *tb, double rate_per_sec, double capacity) { tb->capacity_us = (long)(capacity * 1e6); tb->rate_us = (long)(rate_per_sec); /* tokens/sec → tokens/us = rate/1e6 */ atomic_store(&tb->tokens_us, tb->capacity_us); atomic_store(&tb->last_refill_us, now_us()); } /* Returns true if request is allowed; false if rate-limited */ static inline bool tb_allow(token_bucket_t *tb) { long now = now_us(); long last = atomic_exchange(&tb->last_refill_us, now); long elapsed_us = now - last; /* Add tokens for elapsed time: tokens += rate * elapsed_us */ long new_tokens = tb->rate_us * elapsed_us / 1000000; if (new_tokens > 0) { long current = atomic_fetch_add(&tb->tokens_us, new_tokens * 1000000LL); /* Cap at capacity */ if (current + new_tokens * 1000000LL > tb->capacity_us) atomic_store(&tb->tokens_us, tb->capacity_us); } /* Try to consume one token */ long one_token = 1000000LL; long prev = atomic_fetch_sub(&tb->tokens_us, one_token); if (prev >= one_token) return true; /* allowed */ /* Not enough tokens: restore */ atomic_fetch_add(&tb->tokens_us, one_token); return false; /* rate limited */ } /* Usage: */ /* token_bucket_t per_client_bucket; */ /* tb_init(&per_client_bucket, 100.0, 200.0); // 100 req/sec, burst=200 */ /* if (!tb_allow(&per_client_bucket)) { respond_429(); return; } */
🔬 Lab 1 — Prometheus Metrics + Grafana Dashboard
Instrument a C service and build a RED dashboard in Grafana.
1 Add the metrics module from Tab 8 to the health check HTTP server (M15 Tab 8). Expose /metrics on port 8081 alongside /health/live.
2 Run Prometheus locally: docker run -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus. Configure prometheus.yml to scrape localhost:8081/metrics every 15s.
3 Generate load with wrk -t4 -c100 -d30s http://localhost:8080/orders. Watch metrics accumulate at http://localhost:9090.
4 Run Grafana: docker run -p 3000:3000 grafana/grafana. Add Prometheus as data source. Build a RED dashboard with three panels: request rate, error rate %, p99 latency.
5 Add a Prometheus alerting rule: fire if error rate > 1% for 2 minutes. Simulate errors by making your handler return 500 randomly. Watch the alert move from PENDING to FIRING.
🔬 Lab 2 — Distributed Trace Propagation
Implement W3C traceparent propagation across two C services.
1 Build Service A (port 8080) and Service B (port 8081). Both run the structured logger from Tab 8.
2 Service A: on each request, generate a trace_id (UUID) + span_id. If the request already has traceparent, extract the trace_id and create a child span_id.
3 Service A calls Service B via HTTP, forwarding the traceparent header. Service B logs with the same trace_id from the header.
4 Send a request to Service A. In the combined log output, grep for the trace_id — verify both service logs appear with the same trace_id, showing the full request chain.
5 Bonus: run Jaeger locally (docker run -p 16686:16686 jaegertracing/all-in-one). Use the OTel C SDK to export spans to Jaeger and view the trace waterfall.
🔬 Lab 3 — Token Bucket Rate Limiter Under Load
Verify rate limiter correctness under concurrent load.
1 Integrate the token bucket from Tab 8 into the HTTP server. Return 429 Too Many Requests with Retry-After header when rate-limited.
2 Configure: 100 req/sec rate, 150 token burst capacity.
3 Burst test: send 200 requests simultaneously. Verify exactly ~150 succeed and ~50 receive 429. Check Prometheus counter: http_requests_total{status="4xx"}.
4 Sustained test: wrk -t8 -c100 -d60s at 1000 req/sec (10× limit). Verify roughly 100 req/sec succeed and the rest 429. The rate should be stable over the 60s window.
5 Concurrency test: 8 threads all decrementing the same bucket simultaneously for 10 seconds. Verify no race conditions using TSan: clang -fsanitize=thread.
🔬 Lab 4 — Security: SQL Injection & SSRF Prevention
Demonstrate vulnerability and fix in C using libpq.
1 Write a vulnerable handler: char query[256]; snprintf(query, sizeof(query), "SELECT * FROM orders WHERE id='%s'", user_input); PQexec(conn, query);. Try input: '; DROP TABLE orders; --. Verify it executes.
2 Fix it: use PQexecParams(conn, "SELECT * FROM orders WHERE id=$1", 1, NULL, params, NULL, NULL, 0). Retry the injection — verify it returns no results (treats the entire input as a literal string).
3 Write the SSRF-vulnerable handler: accept a URL from request body and fetch it with libcurl. Try the AWS metadata endpoint: http://169.254.169.254/latest/meta-data/. Verify it returns data.
4 Fix it: integrate safe_fetch_url() from Tab 6. Verify the metadata URL is blocked. Verify a legitimate allowlisted URL succeeds.
── Phase 7 Mastery Checklist ──
Observability
  • Explain the 3 pillars and what question each answers
  • Write a structured JSON log line with all mandatory fields
  • List what must never appear in logs (secrets, PII)
  • Explain how trace_id links logs across services
Metrics
  • Distinguish counter, gauge, histogram, summary
  • Write PromQL for request rate, error rate %, and p99 latency
  • Apply RED method to a service and USE method to a resource
  • Explain why high-cardinality labels are dangerous in Prometheus
  • Write a Prometheus alerting rule with for duration
Tracing
  • Explain trace, span, parent_span_id relationship
  • Parse and construct a W3C traceparent header
  • Choose the right sampling strategy for given traffic/budget
Alerting & SLO
  • Define SLI, SLO, SLA, Error Budget
  • Write a burn rate alert (faster than threshold-crossing alerts)
  • Explain why alerting on symptoms is better than causes
Security
  • Fix SQL injection with parameterized queries in libpq
  • Block SSRF with allowlist + RFC-1918 IP check
  • Explain broken access control with a concrete example
  • Describe dynamic secrets (Vault) vs static env vars
  • Write allowlist input validation for a struct field
Rate Limiting
  • Implement token bucket: capacity, rate, burst
  • Compare sliding window counter vs fixed window (boundary problem)
  • Return correct 429 + Retry-After + X-RateLimit-* headers