M17 — Observability & Hardening
Phase 7
3 pillars: logs · metrics · traces · Prometheus & PromQL · Distributed tracing & OpenTelemetry · Rate limiting · OWASP Top 10 · Secrets management · Graceful shutdown
🔭 The 3 Pillars of Observability
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars each answer a different question:
| Pillar | Question Answered | Data Type | Tools |
|---|---|---|---|
| Logs | "What happened, exactly?" | Discrete events with context | ELK, Loki, Fluentd |
| Metrics | "How fast / how many / how full?" | Aggregated numeric time-series | Prometheus, Grafana, Datadog |
| Traces | "Why was this request slow?" | Causal chains across services | Jaeger, Tempo, Zipkin, Honeycomb |
The three pillars are complementary, not interchangeable. An alert fires on a metric (high p99 latency). You look at a trace to find the slow span. You look at logs from that span to see the exact error. Use all three together.
Analogy — The flight data recorder:
Logs are the cockpit voice recorder — full narrative of what was said. Metrics are the flight data recorder — altitude, speed, attitude plotted over time. Traces are the air traffic control replay — the full path of the aircraft from departure to destination. An accident investigation uses all three.
Logs are the cockpit voice recorder — full narrative of what was said. Metrics are the flight data recorder — altitude, speed, attitude plotted over time. Traces are the air traffic control replay — the full path of the aircraft from departure to destination. An accident investigation uses all three.
📐 Observability vs Monitoring
| Monitoring | Observability | |
|---|---|---|
| Approach | Predefined thresholds and dashboards for known failure modes | Ability to ask arbitrary questions about system behavior |
| Limits | Only catches failures you anticipated and built alerts for | Enables debugging novel, unknown failure modes |
| Data | Aggregated metrics, simple health checks | Logs + metrics + traces with high cardinality |
| Tooling | Nagios, simple dashboards | OpenTelemetry, Honeycomb, Grafana + Loki + Tempo |
Start with monitoring (dashboards for known metrics, alerts on thresholds). Add observability as system complexity grows — when you start debugging failures you didn't anticipate.
📐 Phase 7 Module Map
| Module | Topic | Key Concepts |
|---|---|---|
| M17 (this) | Observability & Hardening | Logs, metrics, traces, alerting, SLO, security, rate limiting |
| M18 | Performance Engineering | Profiling, flame graphs, memory analysis, benchmark methodology |
Prerequisites: Ph6 (Microservices — you need services to observe; health probes from M15 are the basis of readiness checks here)
📝 Structured Logging: JSON Lines Format
Write one JSON object per line to stdout. Every log line must include mandatory fields for searchability:
/* Good: structured JSON log */
{"ts":"2026-03-27T14:23:01.442Z","level":"INFO","service":"order-svc",
"trace_id":"4bf92f3577b34da6","span_id":"00f067aa0ba902b7",
"msg":"order placed","order_id":"ord-9821","user_id":"u-44","amount_usd":49.99}
/* Bad: unstructured text — unsearchable */
[2026-03-27 14:23:01] INFO: Order ord-9821 placed by user u-44 for $49.99
📋 Mandatory Log Fields
| Field | Format | Purpose |
|---|---|---|
ts | ISO-8601 with ms | Timeline reconstruction |
level | DEBUG/INFO/WARN/ERROR/FATAL | Log level filtering |
service | service name | Multi-service log aggregation |
trace_id | hex string | Correlate with traces |
span_id | hex string | Correlate with specific span |
msg | human-readable | Event description |
🎚️ Log Levels — When to Use Each
| Level | Use When |
|---|---|
DEBUG | Verbose detail for local dev only. Never in production — log volume explosion. |
INFO | Normal business events: request received, order placed, job started. |
WARN | Degraded but recoverable: retry succeeded, cache miss, approaching limit. |
ERROR | Unexpected failure requiring attention: DB timeout, invalid state, downstream error. |
FATAL | Unrecoverable — process will exit after logging. |
🚫 What NOT to Log
| Never Log | Why | Alternative |
|---|---|---|
| Passwords, API keys, tokens | Log aggregators, retention, and breach exposure | Log presence/absence, not value |
| Full credit card numbers | PCI-DSS violation | Log last 4 digits only |
| PII (emails, SSN, full name) | GDPR/CCPA violation | Log user_id (opaque reference) |
| Full request/response bodies | Volume, PII risk | Log status codes and latency only |
Health probe hits (/health/*) | Thousands/min of noise | Filter at log aggregator |
Log injection: never embed user-supplied strings directly in log messages without sanitization. A user who sets their name to
","level":"ERROR","msg":"admin escalation can forge log entries. Escape or use parameterized logging.🔗 Correlation: Linking Logs Across Services
The trace_id links all log lines for a single request across every service it touches:
/* API Gateway generates trace_id */
GET /orders/123 → trace_id: 4bf92f3577b34da6
/* Order Service log */
{"service":"order-svc","trace_id":"4bf92f3577b34da6","msg":"fetching order"}
/* Database call log */
{"service":"order-svc","trace_id":"4bf92f3577b34da6","msg":"db query","duration_ms":12}
/* In Loki/Elasticsearch: search by trace_id to see full request timeline */
{trace_id="4bf92f3577b34da6"} | json | sort by ts
In Grafana with Loki + Tempo integration: click a trace in Tempo → jump directly to correlated logs in Loki for that trace_id. This cross-pillar navigation is the power of consistent trace_id propagation.
📊 Metric Types
| Type | Properties | Example | PromQL Usage |
|---|---|---|---|
| Counter | Monotonically increasing, never decreases, resets to 0 on restart | http_requests_total{method="GET",status="200"} | rate(http_requests_total[5m]) → requests/sec |
| Gauge | Point-in-time value, can go up or down | active_connections, memory_usage_bytes, queue_depth | Direct: active_connections > 1000 |
| Histogram | Samples bucketed by value; provides _count, _sum, _bucket | request_duration_seconds{le="0.1"} | histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) |
| Summary | Pre-computed quantiles on client side (less flexible) | request_duration_seconds{quantile="0.99"} | Direct quantile access; can't re-aggregate across instances |
Prefer Histogram over Summary. Histograms can be aggregated across multiple instances (e.g., p99 across all pods). Summaries compute quantiles per-process and can't be aggregated.
🔴 RED Method (Services)
The minimal set of metrics for any service:
- Rate: requests per second — is traffic normal?
rate(http_requests_total[5m]) - Errors: error rate — are users experiencing failures?
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) - Duration: latency percentiles — is it slow?
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))
Alert on all three. High error rate → immediate page. High p99 → investigate. Low rate → traffic drop (upstream issue or deploy broke routing).
📊 USE Method (Resources)
For infrastructure resources (CPU, memory, disk, network):
- Utilization: % time resource is busy
rate(process_cpu_seconds_total[1m]) * 100 - Saturation: extra work queued (can't keep up)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance) - Errors: error count or rate of resource
node_disk_io_time_weighted_seconds_total
High utilization alone isn't a problem. High utilization + high saturation = at capacity. Alert when saturation exceeds 0 (work is queued).
📡 Prometheus: Pull-Based Scraping & Exposition Format
Prometheus polls your service's
/metrics endpoint on a configurable interval (typically 15s). Your service exposes metrics in the Prometheus text format:
# HELP http_requests_total Total HTTP requests by method and status code
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 14823
http_requests_total{method="POST",status="201"} 3291
http_requests_total{method="GET",status="500"} 42
# HELP request_duration_seconds Request latency histogram
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{le="0.01"} 8901
request_duration_seconds_bucket{le="0.05"} 13102
request_duration_seconds_bucket{le="0.1"} 14651
request_duration_seconds_bucket{le="0.5"} 14820
request_duration_seconds_bucket{le="+Inf"} 14823
request_duration_seconds_sum 891.23
request_duration_seconds_count 14823
# HELP active_connections Current active connections
# TYPE active_connections gauge
active_connections 47
Labels are high-cardinality risk. A label like
{user_id="..."} creates one time-series per user — millions of time-series destroy Prometheus. Use low-cardinality labels: method, status, endpoint (grouped). Never use user IDs, trace IDs, or UUIDs as labels.🔍 Essential PromQL Queries
| What You Want | PromQL Query |
|---|---|
| Request rate (req/sec) | rate(http_requests_total[5m]) |
| Error rate % | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 |
| p99 latency | histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket[5m])) by (le)) |
| CPU usage % | 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
| Memory used | node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes |
| Kafka consumer lag | kafka_consumer_group_lag{topic="orders",partition="0"} |
| DB connection pool saturation | pg_stat_activity_count / pg_settings_max_connections |
🔗 Distributed Tracing: Trace & Span Model
A trace represents the complete journey of a single request through the system — from the client's HTTP request through every service, database call, and queue publish it touches.
A span represents a single unit of work within a trace (one service call, one DB query). Each span has:
A span represents a single unit of work within a trace (one service call, one DB query). Each span has:
trace_id— shared across all spans in the same requestspan_id— unique to this operationparent_span_id— the span that triggered this one (null for root span)- Start time + duration
- Attributes (key-value context)
- Status (OK / ERROR)
Trace: trace_id=4bf92f3577b34da6
span_id=00f067aa [API Gateway] GET /orders/123 0ms ──────────────── 87ms
│
├─ span_id=a3ce929d [Order Service] handle_request 2ms ──────── 83ms
│ │
│ ├─ span_id=5e0c22e9 [DB: SELECT orders] 4ms ── 16ms (12ms)
│ │
│ ├─ span_id=7f3d8a1b [Redis: GET cache] 18ms ─ 19ms (1ms)
│ │
│ └─ span_id=2d4f991c [Kafka: publish event] 20ms ──── 34ms (14ms)
│
└─ span_id=b1c5e072 [Auth Service] verify_token 1ms ─ 2ms (1ms)
Flame chart: wider = longer. The DB SELECT at 12ms is the hot spot.
📡 W3C traceparent Header
The standard header for propagating trace context across HTTP service calls:
traceparent: 00-4bf92f3577b34da6a3ce929d-00f067aa0ba902b7-01
│ │ │ │
│ └─ trace_id (128-bit) └─ parent_span_id └─ flags
└─ version (00)
01 = sampled
00 = not sampled
Each service: read traceparent from incoming request, create a child span (using the span_id as parent_span_id), set the new span's span_id, and propagate the updated traceparent in outgoing calls.
📦 OpenTelemetry (OTel)
OpenTelemetry is the CNCF standard for instrumentation — vendor-neutral SDK for generating traces, metrics, and logs.
- SDK: available in C, Go, Java, Python, etc.
- OTLP: OpenTelemetry Protocol — exports to any backend
- OTel Collector: receives OTLP, processes (batch, sample), exports to Jaeger/Tempo/Datadog
- Auto-instrumentation: inject tracing without changing application code (Java agent, eBPF)
- Manual instrumentation: create custom spans for business logic
📊 Sampling Strategies
At high volume (10,000 req/sec), storing every trace is expensive. Sampling decides which traces to keep:
| Strategy | How | Trade-off |
|---|---|---|
| Head-based: Always-on | Keep 100% of traces | Very expensive at scale |
| Head-based: Probability | Keep N% (e.g. 1%) | Simple, but misses rare errors |
| Head-based: Rate-limit | Keep up to N traces/sec | Bounded cost; may drop bursts |
| Tail-based: Error sampling | Buffer all traces; keep if trace has an error span | Catches errors; high memory buffer |
| Tail-based: Latency threshold | Keep if trace duration > P99 threshold | Catches slowness; complex to implement |
Production recommendation: 1% head-based sampling for normal traffic + 100% sampling for traces with errors (tail-based error sampling). This keeps costs bounded while capturing all failure evidence.
🔔 Alerting with Prometheus AlertManager
Prometheus evaluates alerting rules against scraped metrics. When conditions are met, it sends alerts to AlertManager, which routes them to PagerDuty, Slack, email, etc.
# alerting_rules.yml
groups:
- name: order-service
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{service="order-svc",status=~"5.."}[5m])
/ rate(http_requests_total{service="order-svc"}[5m]) > 0.01
for: 2m # must be true for 2m before firing (avoid flapping)
labels:
severity: critical
annotations:
summary: "Error rate > 1% for order-service"
runbook_url: "https://wiki/runbooks/order-svc-errors"
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(request_duration_seconds_bucket{service="order-svc"}[5m]))
by (le)) > 2.0
for: 5m
labels:
severity: warning
- alert: ServiceDown
expr: up{service="order-svc"} == 0
for: 1m
labels:
severity: critical
📏 SLI, SLO, SLA
- SLI (Service Level Indicator): a specific measurable metric — e.g., availability = (successful requests) / (total requests)
- SLO (Service Level Objective): internal target — e.g., availability ≥ 99.9% over 30 days. Engineering commits to this.
- SLA (Service Level Agreement): external contract with customers — stricter legal/financial penalties. SLO should be tighter than SLA as a safety buffer.
- Error Budget: SLO headroom — 99.9% SLO = 0.1% budget = 43.8 min/month. Track burn rate.
💸 Error Budget & Burn Rate Alerts
Don't alert "SLO violated" (too late). Alert on burn rate — how fast you're consuming the error budget:
- Fast burn: consuming 14× normal rate → will exhaust budget in 1h → page immediately
- Slow burn: consuming 3× normal rate → will exhaust in ~5 days → ticket
# Fast burn alert: >14x budget consumption over 1h
expr: |
(1 - (sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))))
/ (1 - 0.999) > 14
📋 Alert Design Principles
- Alert on symptoms, not causes: Alert on high error rate (symptom users feel), not on "CPU at 80%" (cause — may not impact users)
- Every alert needs a runbook: include
runbook_urlannotation. On-call engineers should never face an alert without documented response steps. - Avoid alert fatigue: if the same alert fires weekly and engineers silence it, it's not actionable. Remove or fix it.
- Use
forduration: require condition to be sustained before paging (avoids flapping on 1-second spikes) - Group related alerts: AlertManager can group 50 firing alerts into one notification — prevents notification flood during outages
🔒 OWASP Top 10 for Backend Services
| Vulnerability | C/Backend Example | Prevention |
|---|---|---|
| SQL Injection | "SELECT * FROM users WHERE id=" + user_id | Parameterized queries only: PQexecParams(conn, "SELECT... WHERE id=$1", 1, NULL, params, ...) |
| Command Injection | system("ls " + user_input) | Never use system() with user input. Use execv() with argument array. |
| SSRF | Service fetches URL from user request body; attacker uses http://169.254.169.254/ (AWS metadata) | Allowlist of permitted outbound domains; block RFC-1918 and link-local addresses |
| Broken Access Control | User A can read User B's orders by changing order_id in request | Check authorization on every resource: WHERE id=$1 AND user_id=$2 |
| Security Misconfiguration | Debug endpoints enabled in prod, default credentials, verbose error messages | Separate prod config; disable /debug endpoints; return generic errors |
| Insecure Deserialization | Deserializing untrusted binary input (msgpack, protobuf from user) | Validate schema; set max sizes; reject unknown fields |
| Cryptographic Failures | MD5 for passwords, ECB mode, hardcoded keys | bcrypt/Argon2 for passwords; AES-256-GCM for encryption; libsodium |
🛡️ SSRF Prevention in C
/* SSRF allowlist check before making outbound HTTP request */
#include <netdb.h>
#include <arpa/inet.h>
#include <string.h>
/* Returns 1 if IP is RFC-1918 / link-local / loopback (block these) */
static int is_private_ip(const char *ip) {
struct in_addr addr;
if (!inet_pton(AF_INET, ip, &addr)) return 0;
uint32_t n = ntohl(addr.s_addr);
return
(n >> 24 == 10) /* 10.0.0.0/8 */
|| (n >> 20 == (172 << 4) + 1) /* 172.16.0.0/12 */
|| (n >> 16 == (192 << 8) + 168) /* 192.168.0.0/16 */
|| (n >> 24 == 127) /* 127.0.0.0/8 */
|| (n >> 16 == (169 << 8) + 254); /* 169.254.0.0/16 */
}
int safe_fetch_url(const char *url) {
/* 1. Parse hostname from URL (simplified) */
char hostname[256];
sscanf(url, "https://%255[^/]", hostname);
/* 2. Allowlist check: only permitted domains */
const char *allowed[] = { "api.stripe.com", "hooks.slack.com", NULL };
int permitted = 0;
for (int i = 0; allowed[i]; i++)
if (strcmp(hostname, allowed[i]) == 0) { permitted = 1; break; }
if (!permitted) {
fprintf(stderr, "SSRF blocked: %s not in allowlist\n", hostname);
return -1;
}
/* 3. DNS resolution + IP check */
struct addrinfo *res;
getaddrinfo(hostname, NULL, NULL, &res);
char ip[INET6_ADDRSTRLEN];
inet_ntop(AF_INET,
&((struct sockaddr_in *)res->ai_addr)->sin_addr,
ip, sizeof(ip));
freeaddrinfo(res);
if (is_private_ip(ip)) {
fprintf(stderr, "SSRF blocked: %s resolved to private IP %s\n",
hostname, ip);
return -1;
}
/* 4. Make the actual HTTP request */
return 0; /* proceed */
}
🔑 Secrets Management
| Approach | Security | Details |
|---|---|---|
| Hardcoded in source | ❌ Never | Committed to git, all developers see it, forever in history |
| Environment variables | ⚠️ Acceptable | Not in code but visible in process env, logs, crash dumps — use only with Kubernetes Secrets |
| Kubernetes Secrets | ✅ Good | Base64 in etcd (encrypt etcd at rest); mounted as files or env; access controlled by RBAC |
| HashiCorp Vault | ✅✅ Best | Dynamic secrets (generated on request, auto-expire), audit log, lease renewal, fine-grained access control |
| AWS Secrets Manager / GCP Secret Manager | ✅✅ Best | Managed service equivalent; auto-rotation; IAM-controlled access |
Dynamic secrets (Vault): instead of a long-lived DB password, Vault generates a unique username+password for each service instance with a 1-hour TTL. When the instance dies, the credential expires automatically. Breach impact is bounded in time and scope.
Secret rotation: rotate all secrets after any suspected breach. Never reuse credentials. Implement graceful rotation: support old and new secret simultaneously for 30s during rotation to avoid downtime.
✅ Input Validation: Trust Boundaries
Validate all data at trust boundaries (external inputs). Trust internal calls and framework output.
/* Input validation at trust boundary (HTTP request body) */
typedef struct {
char order_id[37]; /* UUID: 36 chars + null */
double amount;
int item_count;
} order_request_t;
int validate_order_request(const order_request_t *req) {
/* Size bounds */
if (strlen(req->order_id) != 36) return -1;
/* UUID format: 8-4-4-4-12 hex chars with dashes */
if (!is_valid_uuid(req->order_id)) return -1;
/* Business rule bounds */
if (req->amount <= 0.0 || req->amount > 100000.0) return -1;
if (req->item_count < 1 || req->item_count > 100) return -1;
return 0; /* valid */
}
/* Use allowlist validation, not denylist:
know what's valid and reject everything else,
rather than trying to enumerate all invalid inputs */
🚦 Rate Limiting Algorithms
Rate limiting protects services from overload and abusive clients. Implement at two layers:
- API Gateway: global rate limiting per client IP or API key — protects all services
- Per-service: self-defense against gateway bypass or internal traffic spikes
🪣 Token Bucket
A bucket holds up to capacity tokens. Tokens are added at rate r tokens/sec. Each request consumes one token. If the bucket is empty, the request is rejected.
Key property: allows bursts up to capacity while maintaining an average rate of r req/sec.
Key property: allows bursts up to capacity while maintaining an average rate of r req/sec.
Refill rate: 10 tokens/sec Capacity: 20 tokens
t=0: [████████████████████] 20 tokens → burst of 20 requests: OK
t=0.1 [░░░░░░░░░░░░░░░░░░░░] 0 tokens → request: REJECT (429)
t=0.5 [█████░░░░░░░░░░░░░░░] 5 tokens → 5 requests: OK
t=1.0 [██████████░░░░░░░░░░] 10 tokens → 10 requests: OK
Steady state: 10 req/sec sustained (burst allowed up to 20)
🪟 Sliding Window Counter
Divide time into fixed windows. Track request count in the current window and weight with the previous window's count. More accurate than a fixed window but O(1) memory.
Formula:
Formula:
count = prev_window_count × overlap_fraction + curr_window_count
/* Sliding window counter with Redis (pseudocode) */
long sliding_window_count(const char *client_key, int window_sec) {
long now = time(NULL);
long curr_window = now / window_sec;
long prev_window = curr_window - 1;
double elapsed = now % window_sec;
double overlap = 1.0 - elapsed / window_sec;
long prev_count = redis_get_counter(client_key, prev_window);
long curr_count = redis_incr_counter(client_key, curr_window, window_sec);
return (long)(prev_count * overlap + curr_count);
}
⚖️ Algorithm Comparison
| Algorithm | Burst Handling | Memory | Accuracy | Best For |
|---|---|---|---|---|
| Fixed Window Counter | Double burst at boundary (end+start of adjacent windows) | O(1) | Low (boundary problem) | Simple low-traffic systems |
| Sliding Window Log | Exact | O(requests in window) | Exact | Low-volume, exact limits needed |
| Sliding Window Counter | Approximate (±0.1%) | O(1) | High | Most production APIs |
| Token Bucket | Allows bursts up to capacity | O(1) | High | APIs tolerating short bursts |
| Leaky Bucket | Smooths all bursts, strict output rate | O(1) | High | Traffic shaping (network) |
📤 Rate Limit Response Headers
Return these headers on every response so clients can self-throttle:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 # max requests per window
X-RateLimit-Remaining: 47 # requests left in current window
X-RateLimit-Reset: 1711544400 # Unix timestamp when window resets
HTTP/1.1 429 Too Many Requests
Retry-After: 23 # seconds until client can retry
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Content-Type: application/json
{"error": "RATE_LIMIT_EXCEEDED", "retry_after_seconds": 23}
Never silently drop rate-limited requests. Return
429 with Retry-After so well-behaved clients back off correctly. Silently dropping causes clients to retry faster (thundering herd).── Implementation 1 — Prometheus Metrics Exposition Endpoint ──
📡 Prometheus /metrics Endpoint in C
/* metrics.h — thread-safe counters and histograms for Prometheus exposition */
#pragma once
#include <stdatomic.h>
#include <stdio.h>
#include <time.h>
/* HTTP request counter: label dimensions = {method, status} */
typedef struct {
_Atomic(long) get_2xx, get_4xx, get_5xx;
_Atomic(long) post_2xx, post_4xx, post_5xx;
} http_counters_t;
/* Latency histogram: fixed buckets in seconds */
#define BUCKET_COUNT 7
static const double BUCKETS[BUCKET_COUNT] =
{ 0.005, 0.01, 0.025, 0.05, 0.1, 0.5, 1.0 };
typedef struct {
_Atomic(long) bucket[BUCKET_COUNT];
_Atomic(long) count;
_Atomic(double) sum; /* note: atomic double ops may need mutex on older C */
} latency_histogram_t;
/* Global metrics state */
static http_counters_t g_http = {0};
static latency_histogram_t g_lat = {0};
static _Atomic(long) g_active_conn = 0;
static inline void record_request(const char *method, int status, double dur_s) {
/* Increment counter by method+status */
if (strcmp(method, "GET") == 0) {
if (status < 300) atomic_fetch_add(&g_http.get_2xx, 1);
else if (status < 500) atomic_fetch_add(&g_http.get_4xx, 1);
else atomic_fetch_add(&g_http.get_5xx, 1);
}
/* Update histogram */
for (int i = 0; i < BUCKET_COUNT; i++)
if (dur_s <= BUCKETS[i])
atomic_fetch_add(&g_lat.bucket[i], 1);
atomic_fetch_add(&g_lat.count, 1);
/* sum: use mutex for double precision (simplified: use long microseconds) */
atomic_fetch_add(&g_lat.count, 0); /* placeholder */
}
/* Render /metrics response body into buf */
static inline int render_metrics(char *buf, size_t sz) {
int n = 0;
n += snprintf(buf + n, sz - n,
"# HELP http_requests_total Total HTTP requests\n"
"# TYPE http_requests_total counter\n"
"http_requests_total{method=\"GET\",status=\"2xx\"} %ld\n"
"http_requests_total{method=\"GET\",status=\"4xx\"} %ld\n"
"http_requests_total{method=\"GET\",status=\"5xx\"} %ld\n",
atomic_load(&g_http.get_2xx),
atomic_load(&g_http.get_4xx),
atomic_load(&g_http.get_5xx));
n += snprintf(buf + n, sz - n,
"# HELP active_connections Current connections\n"
"# TYPE active_connections gauge\n"
"active_connections %ld\n",
atomic_load(&g_active_conn));
n += snprintf(buf + n, sz - n,
"# HELP request_duration_seconds Latency histogram\n"
"# TYPE request_duration_seconds histogram\n");
for (int i = 0; i < BUCKET_COUNT; i++)
n += snprintf(buf + n, sz - n,
"request_duration_seconds_bucket{le=\"%.3f\"} %ld\n",
BUCKETS[i], atomic_load(&g_lat.bucket[i]));
n += snprintf(buf + n, sz - n,
"request_duration_seconds_bucket{le=\"+Inf\"} %ld\n"
"request_duration_seconds_count %ld\n",
atomic_load(&g_lat.count),
atomic_load(&g_lat.count));
return n;
}
── Implementation 2 — Structured JSON Logger ──
📝 Structured JSON Logger with Trace ID
/* logger.h — structured JSON logger with trace context */
#pragma once
#include <stdio.h>
#include <time.h>
#include <string.h>
typedef struct {
char trace_id[33]; /* 128-bit hex */
char span_id[17]; /* 64-bit hex */
} trace_ctx_t;
/* Thread-local trace context */
static _Thread_local trace_ctx_t tl_trace = {0};
static inline void log_set_trace(const char *trace_id, const char *span_id) {
strncpy(tl_trace.trace_id, trace_id, 32);
strncpy(tl_trace.span_id, span_id, 16);
tl_trace.trace_id[32] = tl_trace.span_id[16] = '\0';
}
static inline const char *get_iso8601(char *buf, size_t n) {
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
struct tm *tm = gmtime(&ts.tv_sec);
int len = strftime(buf, n, "%Y-%m-%dT%H:%M:%S", tm);
snprintf(buf + len, n - len, ".%03ldZ", ts.tv_nsec / 1000000);
return buf;
}
/* JSON-escape a string (handles quotes and backslashes) */
static inline void json_escape(char *dst, size_t dsz,
const char *src) {
size_t d = 0;
for (size_t s = 0; src[s] && d + 2 < dsz; s++) {
if (src[s] == '"' || src[s] == '\\') dst[d++] = '\\';
dst[d++] = src[s];
}
dst[d] = '\0';
}
#define LOG(level, msg_fmt, ...) do { \
char _ts[32], _msg[512], _esc[512]; \
get_iso8601(_ts, sizeof(_ts)); \
snprintf(_msg, sizeof(_msg), msg_fmt, ##__VA_ARGS__); \
json_escape(_esc, sizeof(_esc), _msg); \
fprintf(stdout, \
"{\"ts\":\"%s\",\"level\":\"%s\",\"service\":\"order-svc\"," \
"\"trace_id\":\"%s\",\"span_id\":\"%s\",\"msg\":\"%s\"}\n", \
_ts, level, tl_trace.trace_id, tl_trace.span_id, _esc); \
} while(0)
#define LOG_INFO(fmt, ...) LOG("INFO", fmt, ##__VA_ARGS__)
#define LOG_WARN(fmt, ...) LOG("WARN", fmt, ##__VA_ARGS__)
#define LOG_ERROR(fmt, ...) LOG("ERROR", fmt, ##__VA_ARGS__)
/* Usage: */
/* log_set_trace("4bf92f3577b34da6a3ce929d", "00f067aa0ba902b7"); */
/* LOG_INFO("order placed order_id=%s amount=%.2f", order_id, amount); */
── Implementation 3 — Token Bucket Rate Limiter ──
🪣 Token Bucket Rate Limiter (Thread-Safe, C11 Atomics)
/* token_bucket.h — thread-safe token bucket rate limiter */
#pragma once
#include <stdatomic.h>
#include <time.h>
#include <stdbool.h>
typedef struct {
_Atomic(long) tokens_us; /* tokens * 1e6 (avoid float atomics) */
_Atomic(long) last_refill_us; /* last refill time in microseconds */
long capacity_us; /* max tokens * 1e6 */
long rate_us; /* tokens added per microsecond * 1e6 */
} token_bucket_t;
static inline long now_us(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000LL + ts.tv_nsec / 1000;
}
static inline void tb_init(token_bucket_t *tb,
double rate_per_sec, double capacity) {
tb->capacity_us = (long)(capacity * 1e6);
tb->rate_us = (long)(rate_per_sec); /* tokens/sec → tokens/us = rate/1e6 */
atomic_store(&tb->tokens_us, tb->capacity_us);
atomic_store(&tb->last_refill_us, now_us());
}
/* Returns true if request is allowed; false if rate-limited */
static inline bool tb_allow(token_bucket_t *tb) {
long now = now_us();
long last = atomic_exchange(&tb->last_refill_us, now);
long elapsed_us = now - last;
/* Add tokens for elapsed time: tokens += rate * elapsed_us */
long new_tokens = tb->rate_us * elapsed_us / 1000000;
if (new_tokens > 0) {
long current = atomic_fetch_add(&tb->tokens_us,
new_tokens * 1000000LL);
/* Cap at capacity */
if (current + new_tokens * 1000000LL > tb->capacity_us)
atomic_store(&tb->tokens_us, tb->capacity_us);
}
/* Try to consume one token */
long one_token = 1000000LL;
long prev = atomic_fetch_sub(&tb->tokens_us, one_token);
if (prev >= one_token) return true; /* allowed */
/* Not enough tokens: restore */
atomic_fetch_add(&tb->tokens_us, one_token);
return false; /* rate limited */
}
/* Usage: */
/* token_bucket_t per_client_bucket; */
/* tb_init(&per_client_bucket, 100.0, 200.0); // 100 req/sec, burst=200 */
/* if (!tb_allow(&per_client_bucket)) { respond_429(); return; } */
🔬 Lab 1 — Prometheus Metrics + Grafana Dashboard
Instrument a C service and build a RED dashboard in Grafana.
1 Add the metrics module from Tab 8 to the health check HTTP server (M15 Tab 8). Expose
/metrics on port 8081 alongside /health/live.2 Run Prometheus locally:
docker run -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus. Configure prometheus.yml to scrape localhost:8081/metrics every 15s.3 Generate load with
wrk -t4 -c100 -d30s http://localhost:8080/orders. Watch metrics accumulate at http://localhost:9090.4 Run Grafana:
docker run -p 3000:3000 grafana/grafana. Add Prometheus as data source. Build a RED dashboard with three panels: request rate, error rate %, p99 latency.5 Add a Prometheus alerting rule: fire if error rate > 1% for 2 minutes. Simulate errors by making your handler return 500 randomly. Watch the alert move from PENDING to FIRING.
🔬 Lab 2 — Distributed Trace Propagation
Implement W3C traceparent propagation across two C services.
1 Build Service A (port 8080) and Service B (port 8081). Both run the structured logger from Tab 8.
2 Service A: on each request, generate a
trace_id (UUID) + span_id. If the request already has traceparent, extract the trace_id and create a child span_id.3 Service A calls Service B via HTTP, forwarding the
traceparent header. Service B logs with the same trace_id from the header.4 Send a request to Service A. In the combined log output, grep for the trace_id — verify both service logs appear with the same trace_id, showing the full request chain.
5 Bonus: run Jaeger locally (
docker run -p 16686:16686 jaegertracing/all-in-one). Use the OTel C SDK to export spans to Jaeger and view the trace waterfall.🔬 Lab 3 — Token Bucket Rate Limiter Under Load
Verify rate limiter correctness under concurrent load.
1 Integrate the token bucket from Tab 8 into the HTTP server. Return
429 Too Many Requests with Retry-After header when rate-limited.2 Configure: 100 req/sec rate, 150 token burst capacity.
3 Burst test: send 200 requests simultaneously. Verify exactly ~150 succeed and ~50 receive 429. Check Prometheus counter:
http_requests_total{status="4xx"}.4 Sustained test:
wrk -t8 -c100 -d60s at 1000 req/sec (10× limit). Verify roughly 100 req/sec succeed and the rest 429. The rate should be stable over the 60s window.5 Concurrency test: 8 threads all decrementing the same bucket simultaneously for 10 seconds. Verify no race conditions using TSan:
clang -fsanitize=thread.🔬 Lab 4 — Security: SQL Injection & SSRF Prevention
Demonstrate vulnerability and fix in C using libpq.
1 Write a vulnerable handler:
char query[256]; snprintf(query, sizeof(query), "SELECT * FROM orders WHERE id='%s'", user_input); PQexec(conn, query);. Try input: '; DROP TABLE orders; --. Verify it executes.2 Fix it: use
PQexecParams(conn, "SELECT * FROM orders WHERE id=$1", 1, NULL, params, NULL, NULL, 0). Retry the injection — verify it returns no results (treats the entire input as a literal string).3 Write the SSRF-vulnerable handler: accept a URL from request body and fetch it with libcurl. Try the AWS metadata endpoint:
http://169.254.169.254/latest/meta-data/. Verify it returns data.4 Fix it: integrate
safe_fetch_url() from Tab 6. Verify the metadata URL is blocked. Verify a legitimate allowlisted URL succeeds.── Phase 7 Mastery Checklist ──
Observability
- Explain the 3 pillars and what question each answers
- Write a structured JSON log line with all mandatory fields
- List what must never appear in logs (secrets, PII)
- Explain how trace_id links logs across services
- Distinguish counter, gauge, histogram, summary
- Write PromQL for request rate, error rate %, and p99 latency
- Apply RED method to a service and USE method to a resource
- Explain why high-cardinality labels are dangerous in Prometheus
- Write a Prometheus alerting rule with
forduration
- Explain trace, span, parent_span_id relationship
- Parse and construct a W3C traceparent header
- Choose the right sampling strategy for given traffic/budget
Alerting & SLO
- Define SLI, SLO, SLA, Error Budget
- Write a burn rate alert (faster than threshold-crossing alerts)
- Explain why alerting on symptoms is better than causes
- Fix SQL injection with parameterized queries in libpq
- Block SSRF with allowlist + RFC-1918 IP check
- Explain broken access control with a concrete example
- Describe dynamic secrets (Vault) vs static env vars
- Write allowlist input validation for a struct field
- Implement token bucket: capacity, rate, burst
- Compare sliding window counter vs fixed window (boundary problem)
- Return correct 429 + Retry-After + X-RateLimit-* headers