The Three Pillars of Observability
Each pillar answers a different question — you need all three
Metrics
"WHAT IS THE SYSTEM DOING?"
Aggregated numeric measurements over time. Low storage cost. Excellent for dashboards and alerting. Cannot tell you WHY something is wrong — only THAT something is wrong.
Tools: Prometheus, Datadog,
CloudWatch, Graphite, VictoriaMetrics
CloudWatch, Graphite, VictoriaMetrics
Logs
"WHAT HAPPENED IN DETAIL?"
Timestamped records of discrete events. High storage cost. Rich per-event context. Hard to query at scale. Essential for debugging specific incidents once you know where to look.
Tools: ELK (Elasticsearch+Kibana),
Splunk, Loki+Grafana, CloudWatch Logs
Splunk, Loki+Grafana, CloudWatch Logs
Traces
"WHERE IN THE SYSTEM IS IT SLOW?"
End-to-end journey of a single request across all microservices. Shows latency waterfall. Answers: "Which service is slow?" and "Which dependency is the bottleneck?"
Tools: Jaeger, Zipkin, AWS X-Ray,
Datadog APM, OpenTelemetry
Datadog APM, OpenTelemetry
The key rule: Metrics tell you SOMETHING is wrong. Logs tell you WHAT happened in a specific component. Traces tell you WHERE in the distributed system the problem lives. An on-call engineer uses all three in sequence: metric alert fires → trace shows which service → logs reveal the root cause.
Metrics — The Four Golden Signals
Google SRE Book — the four metrics that matter most for any service
1
LATENCY
How long does a request take? Track p50, p95, p99, p999 — never average (hides outliers). Separate successful latency from error latency.
→ p99 API response time
→ p999 DB query duration
→ Error latency tracked separately
→ p999 DB query duration
→ Error latency tracked separately
2
TRAFFIC
How much demand is the system receiving? Know your peak — design for 2–3× current peak to handle traffic spikes safely.
→ HTTP requests/sec per endpoint
→ Messages/sec through Kafka
→ Bytes/sec read from disk
→ Messages/sec through Kafka
→ Bytes/sec read from disk
3
ERRORS
What fraction of requests are failing? Track error RATE not raw count. Distinguish 4xx (client) from 5xx (server) errors.
→ HTTP 5xx rate per endpoint
→ Failed Kafka consumer events
→ DB transaction rollback rate
→ Failed Kafka consumer events
→ DB transaction rollback rate
4
SATURATION
How "full" is the system? The resource closest to capacity. Saturation predicts problems before they cause errors or timeouts.
→ DB connection pool utilization %
→ CPU / memory utilization
→ Kafka consumer lag
→ CPU / memory utilization
→ Kafka consumer lag
Metric types — Prometheus data modelPROMETHEUS
// COUNTER — monotonically increasing. Use rate() to get per-second rate. http_requests_total{method="GET", status="200"} 145231 rate(http_requests_total[5m]) → requests/sec over 5-min window // GAUGE — current value, can go up or down. Query directly. process_memory_bytes 524288000 → 500MB currently in use db_connection_pool_active 45 → 45 of 100 connections in use // HISTOGRAM — distribution in buckets. Enables percentile calculation. http_request_duration_seconds_bucket{le="0.05"} 8920 ≤50ms: 8920 requests http_request_duration_seconds_bucket{le="0.1"} 9543 ≤100ms: 9543 requests http_request_duration_seconds_bucket{le="0.5"} 9981 ≤500ms: 9981 requests http_request_duration_seconds_bucket{le="Inf"} 10000 total: 10000 requests // p99: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) // USE method (for resources): Utilization, Saturation, Errors // RED method (for services): Rate, Errors, Duration
Logs — Structured & Searchable
JSON structured logs beat unstructured text at every scale
Structured log — the correct formatJSON LOG
{
"timestamp": "2025-03-07T14:23:45.123Z",
"level": "ERROR",
"service": "payment-service",
"version": "v2.4.1",
"trace_id": "abc123def456", ← links to distributed trace
"span_id": "7890abcd",
"user_id": "u_98765", ← hashed if GDPR-sensitive
"order_id": "ord_12345",
"message": "Payment processing failed",
"error_code": "CARD_DECLINED",
"amount_cents": 9999,
"duration_ms": 234,
"host": "payment-pod-7d4b9c"
}
// log levels: DEBUG (dev) → INFO (business events) → WARN (recoverable)
// → ERROR (needs investigation) → FATAL (immediate action)
// Log pipeline: App → Fluentd/Logstash → Kafka (buffer) → Elasticsearch → Kibana
// Cheaper alternative: App → Promtail → Loki (label-based) → Grafana
// TAIL-BASED SAMPLING (preferred):
// Keep 100% of ERROR/WARN logs. Sample 1% of INFO. Discard DEBUG.
// Preserves signal, reduces storage cost by ~50x on high-traffic services.
// Retention tiers:
// Hot (Elasticsearch): 7 days → fast full-text search, recent incidents
// Warm (S3-backed): 30 days → slower queries, post-incident review
// Cold (S3 Glacier): 90 days → compliance only, rarely queried
Distributed Tracing
Follow a single request end-to-end across every service it touches
Trace waterfall — spotting the bottleneck at a glanceTRACE EXAMPLE
// Trace: place_order request (trace_id: abc123) // Each bar = one span (one service's processing time) Span 1: API Gateway |████████████████████████| 95ms total Span 2: Auth Service |██| 3ms Span 3: Order Service |████████████████| 80ms total Span 4: Feature Store call |████| 8ms Span 5: MySQL write |████████████| 60ms ← BOTTLENECK Span 6: Kafka publish |██| 4ms Span 7: Notification Service |████| 8ms // Without tracing: "order service is slow" — check all of its dependencies // With tracing: "MySQL write is taking 60ms" — go check DB explain plan, indexes // Context propagation via HTTP headers (B3 format, used by Zipkin/Jaeger): X-B3-TraceId: abc123def456789 ← same for all spans in this trace X-B3-SpanId: 7890abcd ← unique per span X-B3-ParentSpanId: 1234efgh ← parent span's ID X-B3-Sampled: 1 ← 1=sample this trace, 0=don't // OpenTelemetry (OTel): vendor-neutral standard // Write instrumentation once → export to Jaeger, Datadog, or any backend // SDK: Java, Python, Go, Node.js — all supported // Sampling strategy: // Head-based: decide at trace root (random 1%) — simple but misses rare errors // Tail-based: decide after trace completes — keep 100% of errors/slow traces // Preferred: tail-based with head-based as fallback
SLI / SLO / SLA
Define what "reliable" means before you can measure or improve it
SLI
Service Level Indicator
THE MEASURED METRIC
The actual number you track. A ratio of good events to total events.
Example: SLI = (requests returning <200ms) / total_requests × 100%
Good SLIs: availability %, latency %, freshness %, correctness %
Bad SLIs: raw request count, uptime of a single server
SLO
Service Level Objective
YOUR INTERNAL TARGET
Your target for the SLI. Set based on what users need, not what's technically easy.
Example: "99.9% of homepage requests complete in <200ms over a 28-day window"
SLOs are internal — no contracts, no penalties. Drives engineering decisions.
SLA
Service Level Agreement
THE CONTRACT
Contractual commitment to customers. With penalties (credits, refunds) if breached.
SLA MUST be weaker than SLO — leave a buffer for incidents + measurement gaps.
Rule: if SLO = 99.9%, set SLA = 99.5%. Never set SLA = SLO.
Error budget calculation — the math behind the policyMATH
// Error budget = 100% - SLO = allowed failure rate SLO = 99.9% → error budget = 0.1% of requests may fail Over 30 days: 0.001 × 30d × 24h × 60m = 43.2 minutes total downtime allowed SLO = 99.99% → error budget = 0.01% Over 30 days: 4.38 minutes total downtime allowed SLO = 99.5% → error budget = 0.5% Over 30 days: 3.6 hours total downtime allowed // Choose SLO based on tier, not ambition: Critical path (payment, auth login): 99.99% → 4.38 min/month Core product (feed, search, API): 99.9% → 43.2 min/month Non-critical (analytics, recs, admin):99.5% → 3.6 hr/month // Tighter SLO = fewer feature deployments = slower iteration // Setting 99.99% for everything = no deploys ever = engineering paralysis
Error Budget Policy
Turns reliability into an objective, data-driven conversation between product and engineering
●
> 50%
GREEN — HEALTHY
Deploy freely. Take calculated risks. Ship experimental features. Iterate fast. The budget is there to be spent on innovation.
●
10–50%
YELLOW — CAUTION
Proceed with caution. Extra review on all deployments. No risky or experimental changes. Fix known reliability issues in next sprint.
●
< 10%
RED — DANGER
Freeze all non-critical feature deployments. Full reliability focus. All engineering effort on reducing error rate and technical debt.
●
EXHAUSTED
SLA AT RISK
Escalate to leadership. Full incident mode until budget resets. Customer credits may be triggered. Post-mortem required before any new deploys.
Why error budgets work: Without them, product says "ship faster" and engineering says "we need reliability." Both are right but can't agree. With an error budget, the conversation becomes objective: "We have 12 minutes of budget left this month. Your feature has a 20% chance of causing a 5-minute incident. The math says no." Product can accept math — they can't always accept "engineering intuition."
Alerting Design
Alert on symptoms, not causes — every alert must be actionable and urgent
| LEVEL | WHEN TO USE | RESPONSE | EXAMPLE |
|---|---|---|---|
| PAGE | SLO breach occurring or imminent. Users impacted NOW. | Wake someone up immediately. Drop everything. | Error rate >1% for 5 min on payments |
| TICKET | Degraded performance, not yet breaching SLO. Trend concerning. | Address next business day. No 3am wake-up. | p99 latency +30% (still within SLO) |
| LOG | Informational. No action needed. Useful for dashboards only. | Review weekly. No alert sent. | Cache hit rate dropped 5% |
Burn rate alerting — the Google SRE recommended approachPROMETHEUS ALERT RULES
// Burn rate = how fast you're consuming the error budget // Burn rate 1.0 = consuming at exactly the SLO rate (budget depletes at month end) // Burn rate 14.4 = consuming 14.4× faster → 1hr window consumes 2% of 30-day budget // FAST BURN — page immediately (short window catches acute outage) alert: ErrorBudgetBurnFast expr: rate(http_errors[1h]) / rate(http_requests[1h]) > (14.4 * 0.001) ^^ error rate ^^ burns budget 14.4× faster than allowed for: 2m severity: page message: "Fast error budget burn — action required immediately" // SLOW BURN — urgent ticket (longer window catches gradual degradation) alert: ErrorBudgetBurnSlow expr: rate(http_errors[6h]) / rate(http_requests[6h]) > (6 * 0.001) for: 15m severity: ticket // BAD ALERT — avoid "CPU > 80%" // CPU can be high while system is healthy (batch job running) // CPU can be low while system is broken (stuck waiting for DB) // Alert on what users experience, not what internal resources are doing // Alert fatigue rule: if alert fires >1/shift → raise threshold, add duration, or demote
Incident Response
A structured process ensures fast mitigation and organizational learning
SEV1
CRITICAL
Complete outage. All customers impacted. C-suite aware. All hands on deck. Revenue loss per minute.
SEV2
MAJOR
Core feature broken. Large % of customers impacted. Urgent escalation. Primary on-call leads.
SEV3
MINOR
Partial degradation. Workaround available. Small % impacted. Address during business hours.
SEV4
LOW
Minor issue. No immediate impact. Tracked in backlog. Fix in next sprint.
// INCIDENT LIFECYCLE — SEV1 example
T+0m
Alert fires. On-call pages. SEV1 declared. IC assigned.
T+5m
IC assembles team. Scribe opens incident doc. Comms updates status page: "Investigating."
T+12m
TL identifies root cause: bad deploy at T-3m. New model causing OOM on payment pods.
T+15m
MITIGATION: rollback deploy. Error rate begins dropping. IC: "Stop the bleeding first."
T+22m
Error rate returns to baseline. Service restored. Status page updated: "Resolved."
T+48h
Blameless postmortem published. 5 action items filed. No personal blame. 5 Whys completed.
Blameless postmortem structure: (1) Impact — who affected, for how long, user experience. (2) Timeline — minute-by-minute. (3) Root cause — technical cause using 5 Whys. (4) Contributing factors — systemic issues. (5) Action items — specific, assigned, time-bound. Never write "human error" as root cause — humans made reasonable decisions with available information. Fix the system so the same decision doesn't cause the same failure.
Chaos Engineering
Break things deliberately in a controlled way — before production breaks them for you
Chaos Engineering principle: Define a steady state (your SLI baseline). Inject a realistic failure. Observe whether the system maintains steady state. If it does — resilience confirmed. If it doesn't — you found a weakness before your users did. Fix it. Then experiment again.
Pod Kill
Terminate a random pod mid-traffic. Does the deployment self-heal? Does the load balancer route around it? How fast?
Expected: new pod starts in <30s
Alert: if error rate spikes >1%
Tool: Chaos Monkey, LitmusChaos
Alert: if error rate spikes >1%
Tool: Chaos Monkey, LitmusChaos
AZ Failure
Cut all traffic to one Availability Zone. Does the system reroute to the other AZs? Does health check remove the AZ?
Expected: <30s reroute, <1% errors
Alert: latency increase during switch
Tool: AWS FIS, network rules
Alert: latency increase during switch
Tool: AWS FIS, network rules
Latency Injection
Add 500ms to all DB calls. Do timeouts fire correctly? Do circuit breakers open? Does the UI show degraded state?
Expected: timeout after 1s, circuit opens
Alert: p99 latency alert fires correctly
Tool: TC netem, Toxiproxy
Alert: p99 latency alert fires correctly
Tool: TC netem, Toxiproxy
Dependency Down
Take down a non-critical service (notifications). Does the core path (payment) continue working? Is graceful degradation implemented?
Expected: core path unaffected
Notification failures logged but ignored
Tool: iptables, AWS FIS
Notification failures logged but ignored
Tool: iptables, AWS FIS
Resource Exhaustion
Fill disk to 95% on a node. Does the system alert? Does it continue serving? Does log rotation prevent full disk?
Expected: alert at 85%, graceful at 95%
No data loss from full disk
Tool: dd, stress-ng
No data loss from full disk
Tool: dd, stress-ng
CPU Spike
Max out CPU on one node. Does the HPA (Horizontal Pod Autoscaler) kick in? Does load balancer route less traffic to overloaded pod?
Expected: scale-out in <2 min
p99 latency increase <50%
Tool: stress-ng, k6 load test
p99 latency increase <50%
Tool: stress-ng, k6 load test
1
SLO Design for E-Commerce Checkout
›
- Define 3 SLIs for checkout: availability (% success), latency (% under threshold), and freshness (cart data staleness). Write each as a measurable formula.
- Set SLOs for each. Checkout is revenue-critical; users expect <2s page loads. What SLOs are appropriate?
- Calculate error budgets (minutes/month) for each SLO you chose.
- A deployment causes checkout errors for 8 minutes. How much of the availability budget is consumed? What budget level does that leave? What policy applies?
- Product wants to ship a major checkout redesign with 10% risk of incident. Current error budget consumed: 70%. Make and justify your recommendation.
2
Alerting Strategy for URL Shortener (B5)
›
- Define all four golden signals for the URL shortener. What specific metric represents each signal for this service?
- Write 3 alert rules with thresholds and duration windows: one for availability (error rate), one for latency (p99), one for saturation (Redis memory).
- The redirect endpoint returns 404 for 2% of short URLs. Is this a page, ticket, or log? At what 404 rate would it become a page?
- An on-call engineer receives 15 alerts per shift. What is wrong? Name three specific fixes to reduce noise without losing signal.
3
Distributed Tracing for WhatsApp (B7)
›
- Define the spans for a message send flow: client → WebSocket server → Kafka → router → chat server → recipient. What is the parent-child relationship?
- Where do you propagate trace context? What HTTP headers? How do you pass trace context through Kafka (hint: Kafka message headers)?
- A user reports "my message took 10 seconds to deliver." Without tracing, how would you debug? With tracing, what specifically do you look for in the waterfall?
- You're sampling 1% of traces. The slow 10-second delivery is in the other 99%. How do you ensure this specific trace was captured? What sampling strategy?
★
Design a Self-Hosted Observability Platform
›
500 engineers, 5000 microservices, 10 TB/day logs, 1M metrics series, millions of traces/day.
- Metrics pipeline: instrumentation → collection → storage → alerting. Technology choices, scale numbers, retention strategy.
- Log pipeline: ingestion → buffering → indexing → tiered retention. How do you handle 10 TB/day economically? What does the Elasticsearch cluster look like?
- Trace pipeline: instrumentation (OTel) → sampling strategy → storage → query backend. What's your storage choice for traces and why?
- Unified correlation: when an alert fires, how does the on-call engineer navigate from metric → relevant logs → relevant trace in under 60 seconds?
- SLO management at scale: storing, tracking, and alerting on error budgets for 5000 services. Can you use Prometheus for this? What's the schema?
- Top 3 cost optimization strategies for logs and traces without sacrificing observability for incidents.
0 / 24 completedMODULE C4 · OBSERVABILITY & SRE
Three pillars: metrics (what), logs (what happened), traces (where)
Four golden signals: latency, traffic, errors, saturation
Track p99 not average — average hides tail latency outliers
Metric types: counter (rate()), gauge (current), histogram (percentiles)
RED method (services): Rate, Errors, Duration
USE method (resources): Utilization, Saturation, Errors
Structured JSON logs: trace_id field links logs to traces
Tail-based log sampling: keep 100% errors, sample 1% normal
Log retention tiers: hot (7d Elasticsearch), warm (30d), cold (90d Glacier)
Distributed tracing: trace, span, parent-child structure, trace_id propagation
B3 headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId
OpenTelemetry: vendor-neutral, SDK + Collector architecture
SLI: measured ratio of good events to total events
SLO: internal target — set based on user need, not technical ease
SLA: contractual commitment, always weaker than SLO with buffer
Error budget math: 99.9% SLO → 43.2 min/month budget
Error budget policy: green/yellow/red/exhausted — what each means for deploys
Alert on symptoms not causes — actionable + urgent
Burn rate alerting: fast (1h, 14.4×) = page; slow (6h, 6×) = ticket
Alert fatigue: >1 page/shift = alert needs tuning
Incident roles: IC (coordinates), TL (debugs), comms, scribe
Blameless postmortem: 5 sections, no personal blame, 5 Whys
Chaos engineering: steady state → inject failure → observe → validate
✏️ Tasks 1–4 completed (SLO design, alerting, tracing, observability platform)
// NEXT MODULE
C5 — Security Architecture
OAuth2 / OIDC · JWT deep dive · Zero-Trust architecture
Secrets management · mTLS · API security · OWASP Top 10
Rate limiting for abuse prevention · DDoS mitigation
Secrets management · mTLS · API security · OWASP Top 10
Rate limiting for abuse prevention · DDoS mitigation