SYSTEM DESIGN MASTERY · TRACK C · MODULE C4 · WEEK 28 METRICS · LOGS · TRACES · SLO/SLI/SLA · INCIDENT RESPONSE · CHAOS
// TRACK C · ADVANCED TOPICS · SITE RELIABILITY ENGINEERING

Observability
& SRE

THREE PILLARS · FOUR GOLDEN SIGNALS · SLO/SLI/SLA
ERROR BUDGETS · INCIDENT RESPONSE · CHAOS ENGINEERING
3
PILLARS
4
GOLDEN SIGNALS
99.9%
EXAMPLE SLO
C4
MODULE
Metrics
Logs
Traces
SLI / SLO / SLA
Error Budget
Alerting
Incident Response
Chaos Engineering
The Three Pillars of Observability
Each pillar answers a different question — you need all three
📊
Metrics
"WHAT IS THE SYSTEM DOING?"
Aggregated numeric measurements over time. Low storage cost. Excellent for dashboards and alerting. Cannot tell you WHY something is wrong — only THAT something is wrong.
Tools: Prometheus, Datadog,
CloudWatch, Graphite, VictoriaMetrics
📋
Logs
"WHAT HAPPENED IN DETAIL?"
Timestamped records of discrete events. High storage cost. Rich per-event context. Hard to query at scale. Essential for debugging specific incidents once you know where to look.
Tools: ELK (Elasticsearch+Kibana),
Splunk, Loki+Grafana, CloudWatch Logs
🔍
Traces
"WHERE IN THE SYSTEM IS IT SLOW?"
End-to-end journey of a single request across all microservices. Shows latency waterfall. Answers: "Which service is slow?" and "Which dependency is the bottleneck?"
Tools: Jaeger, Zipkin, AWS X-Ray,
Datadog APM, OpenTelemetry
The key rule: Metrics tell you SOMETHING is wrong. Logs tell you WHAT happened in a specific component. Traces tell you WHERE in the distributed system the problem lives. An on-call engineer uses all three in sequence: metric alert fires → trace shows which service → logs reveal the root cause.
Metrics — The Four Golden Signals
Google SRE Book — the four metrics that matter most for any service
1
LATENCY
How long does a request take? Track p50, p95, p99, p999 — never average (hides outliers). Separate successful latency from error latency.
→ p99 API response time
→ p999 DB query duration
→ Error latency tracked separately
2
TRAFFIC
How much demand is the system receiving? Know your peak — design for 2–3× current peak to handle traffic spikes safely.
→ HTTP requests/sec per endpoint
→ Messages/sec through Kafka
→ Bytes/sec read from disk
3
ERRORS
What fraction of requests are failing? Track error RATE not raw count. Distinguish 4xx (client) from 5xx (server) errors.
→ HTTP 5xx rate per endpoint
→ Failed Kafka consumer events
→ DB transaction rollback rate
4
SATURATION
How "full" is the system? The resource closest to capacity. Saturation predicts problems before they cause errors or timeouts.
→ DB connection pool utilization %
→ CPU / memory utilization
→ Kafka consumer lag
Metric types — Prometheus data modelPROMETHEUS
// COUNTER — monotonically increasing. Use rate() to get per-second rate.
http_requests_total{method="GET", status="200"} 145231
rate(http_requests_total[5m])  → requests/sec over 5-min window

// GAUGE — current value, can go up or down. Query directly.
process_memory_bytes 524288000          → 500MB currently in use
db_connection_pool_active 45            → 45 of 100 connections in use

// HISTOGRAM — distribution in buckets. Enables percentile calculation.
http_request_duration_seconds_bucket{le="0.05"}  8920   ≤50ms: 8920 requests
http_request_duration_seconds_bucket{le="0.1"}   9543   ≤100ms: 9543 requests
http_request_duration_seconds_bucket{le="0.5"}   9981   ≤500ms: 9981 requests
http_request_duration_seconds_bucket{le="Inf"}  10000   total: 10000 requests
// p99: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

// USE method (for resources): Utilization, Saturation, Errors
// RED method (for services): Rate, Errors, Duration
Logs — Structured & Searchable
JSON structured logs beat unstructured text at every scale
Structured log — the correct formatJSON LOG
{
  "timestamp":   "2025-03-07T14:23:45.123Z",
  "level":        "ERROR",
  "service":      "payment-service",
  "version":      "v2.4.1",
  "trace_id":     "abc123def456",    ← links to distributed trace
  "span_id":      "7890abcd",
  "user_id":      "u_98765",          ← hashed if GDPR-sensitive
  "order_id":     "ord_12345",
  "message":      "Payment processing failed",
  "error_code":   "CARD_DECLINED",
  "amount_cents": 9999,
  "duration_ms":  234,
  "host":         "payment-pod-7d4b9c"
}

// log levels: DEBUG (dev) → INFO (business events) → WARN (recoverable)
//             → ERROR (needs investigation) → FATAL (immediate action)

// Log pipeline: App → Fluentd/Logstash → Kafka (buffer) → Elasticsearch → Kibana
// Cheaper alternative: App → Promtail → Loki (label-based) → Grafana

// TAIL-BASED SAMPLING (preferred):
// Keep 100% of ERROR/WARN logs. Sample 1% of INFO. Discard DEBUG.
// Preserves signal, reduces storage cost by ~50x on high-traffic services.

// Retention tiers:
// Hot  (Elasticsearch):  7 days   → fast full-text search, recent incidents
// Warm (S3-backed):      30 days  → slower queries, post-incident review
// Cold (S3 Glacier):     90 days  → compliance only, rarely queried
Distributed Tracing
Follow a single request end-to-end across every service it touches
Trace waterfall — spotting the bottleneck at a glanceTRACE EXAMPLE
// Trace: place_order request (trace_id: abc123)
// Each bar = one span (one service's processing time)

Span 1: API Gateway          |████████████████████████| 95ms total
  Span 2: Auth Service         |██| 3ms
  Span 3: Order Service          |████████████████| 80ms total
    Span 4: Feature Store call     |████| 8ms
    Span 5: MySQL write            |████████████| 60ms  ← BOTTLENECK
    Span 6: Kafka publish          |██| 4ms
  Span 7: Notification Service   |████| 8ms

// Without tracing: "order service is slow" — check all of its dependencies
// With tracing: "MySQL write is taking 60ms" — go check DB explain plan, indexes

// Context propagation via HTTP headers (B3 format, used by Zipkin/Jaeger):
X-B3-TraceId:    abc123def456789   ← same for all spans in this trace
X-B3-SpanId:     7890abcd           ← unique per span
X-B3-ParentSpanId: 1234efgh         ← parent span's ID
X-B3-Sampled:    1                  ← 1=sample this trace, 0=don't

// OpenTelemetry (OTel): vendor-neutral standard
// Write instrumentation once → export to Jaeger, Datadog, or any backend
// SDK: Java, Python, Go, Node.js — all supported

// Sampling strategy:
// Head-based: decide at trace root (random 1%) — simple but misses rare errors
// Tail-based: decide after trace completes — keep 100% of errors/slow traces
// Preferred: tail-based with head-based as fallback
SLI / SLO / SLA
Define what "reliable" means before you can measure or improve it
SLI
Service Level Indicator
THE MEASURED METRIC
The actual number you track. A ratio of good events to total events. Example: SLI = (requests returning <200ms) / total_requests × 100% Good SLIs: availability %, latency %, freshness %, correctness % Bad SLIs: raw request count, uptime of a single server
SLO
Service Level Objective
YOUR INTERNAL TARGET
Your target for the SLI. Set based on what users need, not what's technically easy. Example: "99.9% of homepage requests complete in <200ms over a 28-day window" SLOs are internal — no contracts, no penalties. Drives engineering decisions.
SLA
Service Level Agreement
THE CONTRACT
Contractual commitment to customers. With penalties (credits, refunds) if breached. SLA MUST be weaker than SLO — leave a buffer for incidents + measurement gaps. Rule: if SLO = 99.9%, set SLA = 99.5%. Never set SLA = SLO.
Error budget calculation — the math behind the policyMATH
// Error budget = 100% - SLO = allowed failure rate

SLO = 99.9%  →  error budget = 0.1% of requests may fail
  Over 30 days: 0.001 × 30d × 24h × 60m = 43.2 minutes total downtime allowed

SLO = 99.99% →  error budget = 0.01%
  Over 30 days: 4.38 minutes total downtime allowed

SLO = 99.5%  →  error budget = 0.5%
  Over 30 days: 3.6 hours total downtime allowed

// Choose SLO based on tier, not ambition:
Critical path (payment, auth login):  99.99%  →  4.38 min/month
Core product (feed, search, API):     99.9%   → 43.2 min/month
Non-critical (analytics, recs, admin):99.5%   →  3.6 hr/month

// Tighter SLO = fewer feature deployments = slower iteration
// Setting 99.99% for everything = no deploys ever = engineering paralysis
Error Budget Policy
Turns reliability into an objective, data-driven conversation between product and engineering
> 50%
GREEN — HEALTHY
Deploy freely. Take calculated risks. Ship experimental features. Iterate fast. The budget is there to be spent on innovation.
10–50%
YELLOW — CAUTION
Proceed with caution. Extra review on all deployments. No risky or experimental changes. Fix known reliability issues in next sprint.
< 10%
RED — DANGER
Freeze all non-critical feature deployments. Full reliability focus. All engineering effort on reducing error rate and technical debt.
EXHAUSTED
SLA AT RISK
Escalate to leadership. Full incident mode until budget resets. Customer credits may be triggered. Post-mortem required before any new deploys.
Why error budgets work: Without them, product says "ship faster" and engineering says "we need reliability." Both are right but can't agree. With an error budget, the conversation becomes objective: "We have 12 minutes of budget left this month. Your feature has a 20% chance of causing a 5-minute incident. The math says no." Product can accept math — they can't always accept "engineering intuition."
Alerting Design
Alert on symptoms, not causes — every alert must be actionable and urgent
LEVELWHEN TO USERESPONSEEXAMPLE
PAGESLO breach occurring or imminent. Users impacted NOW.Wake someone up immediately. Drop everything.Error rate >1% for 5 min on payments
TICKETDegraded performance, not yet breaching SLO. Trend concerning.Address next business day. No 3am wake-up.p99 latency +30% (still within SLO)
LOGInformational. No action needed. Useful for dashboards only.Review weekly. No alert sent.Cache hit rate dropped 5%
Burn rate alerting — the Google SRE recommended approachPROMETHEUS ALERT RULES
// Burn rate = how fast you're consuming the error budget
// Burn rate 1.0 = consuming at exactly the SLO rate (budget depletes at month end)
// Burn rate 14.4 = consuming 14.4× faster → 1hr window consumes 2% of 30-day budget

// FAST BURN — page immediately (short window catches acute outage)
alert: ErrorBudgetBurnFast
expr: rate(http_errors[1h]) / rate(http_requests[1h]) > (14.4 * 0.001)
       ^^ error rate            ^^ burns budget 14.4× faster than allowed
for: 2m
severity: page
message: "Fast error budget burn — action required immediately"

// SLOW BURN — urgent ticket (longer window catches gradual degradation)
alert: ErrorBudgetBurnSlow
expr: rate(http_errors[6h]) / rate(http_requests[6h]) > (6 * 0.001)
for: 15m
severity: ticket

// BAD ALERT — avoid "CPU > 80%"
// CPU can be high while system is healthy (batch job running)
// CPU can be low while system is broken (stuck waiting for DB)
// Alert on what users experience, not what internal resources are doing

// Alert fatigue rule: if alert fires >1/shift → raise threshold, add duration, or demote
Incident Response
A structured process ensures fast mitigation and organizational learning
SEV1
CRITICAL
Complete outage. All customers impacted. C-suite aware. All hands on deck. Revenue loss per minute.
SEV2
MAJOR
Core feature broken. Large % of customers impacted. Urgent escalation. Primary on-call leads.
SEV3
MINOR
Partial degradation. Workaround available. Small % impacted. Address during business hours.
SEV4
LOW
Minor issue. No immediate impact. Tracked in backlog. Fix in next sprint.
// INCIDENT LIFECYCLE — SEV1 example
T+0m
Alert fires. On-call pages. SEV1 declared. IC assigned.
T+5m
IC assembles team. Scribe opens incident doc. Comms updates status page: "Investigating."
T+12m
TL identifies root cause: bad deploy at T-3m. New model causing OOM on payment pods.
T+15m
MITIGATION: rollback deploy. Error rate begins dropping. IC: "Stop the bleeding first."
T+22m
Error rate returns to baseline. Service restored. Status page updated: "Resolved."
T+48h
Blameless postmortem published. 5 action items filed. No personal blame. 5 Whys completed.
Blameless postmortem structure: (1) Impact — who affected, for how long, user experience. (2) Timeline — minute-by-minute. (3) Root cause — technical cause using 5 Whys. (4) Contributing factors — systemic issues. (5) Action items — specific, assigned, time-bound. Never write "human error" as root cause — humans made reasonable decisions with available information. Fix the system so the same decision doesn't cause the same failure.
Chaos Engineering
Break things deliberately in a controlled way — before production breaks them for you
Chaos Engineering principle: Define a steady state (your SLI baseline). Inject a realistic failure. Observe whether the system maintains steady state. If it does — resilience confirmed. If it doesn't — you found a weakness before your users did. Fix it. Then experiment again.
Pod Kill
Terminate a random pod mid-traffic. Does the deployment self-heal? Does the load balancer route around it? How fast?
Expected: new pod starts in <30s
Alert: if error rate spikes >1%
Tool: Chaos Monkey, LitmusChaos
AZ Failure
Cut all traffic to one Availability Zone. Does the system reroute to the other AZs? Does health check remove the AZ?
Expected: <30s reroute, <1% errors
Alert: latency increase during switch
Tool: AWS FIS, network rules
Latency Injection
Add 500ms to all DB calls. Do timeouts fire correctly? Do circuit breakers open? Does the UI show degraded state?
Expected: timeout after 1s, circuit opens
Alert: p99 latency alert fires correctly
Tool: TC netem, Toxiproxy
Dependency Down
Take down a non-critical service (notifications). Does the core path (payment) continue working? Is graceful degradation implemented?
Expected: core path unaffected
Notification failures logged but ignored
Tool: iptables, AWS FIS
Resource Exhaustion
Fill disk to 95% on a node. Does the system alert? Does it continue serving? Does log rotation prevent full disk?
Expected: alert at 85%, graceful at 95%
No data loss from full disk
Tool: dd, stress-ng
CPU Spike
Max out CPU on one node. Does the HPA (Horizontal Pod Autoscaler) kick in? Does load balancer route less traffic to overloaded pod?
Expected: scale-out in <2 min
p99 latency increase <50%
Tool: stress-ng, k6 load test
1
SLO Design for E-Commerce Checkout
~1.5 hrs
  1. Define 3 SLIs for checkout: availability (% success), latency (% under threshold), and freshness (cart data staleness). Write each as a measurable formula.
  2. Set SLOs for each. Checkout is revenue-critical; users expect <2s page loads. What SLOs are appropriate?
  3. Calculate error budgets (minutes/month) for each SLO you chose.
  4. A deployment causes checkout errors for 8 minutes. How much of the availability budget is consumed? What budget level does that leave? What policy applies?
  5. Product wants to ship a major checkout redesign with 10% risk of incident. Current error budget consumed: 70%. Make and justify your recommendation.
2
Alerting Strategy for URL Shortener (B5)
~1 hr
  1. Define all four golden signals for the URL shortener. What specific metric represents each signal for this service?
  2. Write 3 alert rules with thresholds and duration windows: one for availability (error rate), one for latency (p99), one for saturation (Redis memory).
  3. The redirect endpoint returns 404 for 2% of short URLs. Is this a page, ticket, or log? At what 404 rate would it become a page?
  4. An on-call engineer receives 15 alerts per shift. What is wrong? Name three specific fixes to reduce noise without losing signal.
3
Distributed Tracing for WhatsApp (B7)
~1.5 hrs
  1. Define the spans for a message send flow: client → WebSocket server → Kafka → router → chat server → recipient. What is the parent-child relationship?
  2. Where do you propagate trace context? What HTTP headers? How do you pass trace context through Kafka (hint: Kafka message headers)?
  3. A user reports "my message took 10 seconds to deliver." Without tracing, how would you debug? With tracing, what specifically do you look for in the waterfall?
  4. You're sampling 1% of traces. The slow 10-second delivery is in the other 99%. How do you ensure this specific trace was captured? What sampling strategy?
Design a Self-Hosted Observability Platform
~3 hrs

500 engineers, 5000 microservices, 10 TB/day logs, 1M metrics series, millions of traces/day.

  1. Metrics pipeline: instrumentation → collection → storage → alerting. Technology choices, scale numbers, retention strategy.
  2. Log pipeline: ingestion → buffering → indexing → tiered retention. How do you handle 10 TB/day economically? What does the Elasticsearch cluster look like?
  3. Trace pipeline: instrumentation (OTel) → sampling strategy → storage → query backend. What's your storage choice for traces and why?
  4. Unified correlation: when an alert fires, how does the on-call engineer navigate from metric → relevant logs → relevant trace in under 60 seconds?
  5. SLO management at scale: storing, tracking, and alerting on error budgets for 5000 services. Can you use Prometheus for this? What's the schema?
  6. Top 3 cost optimization strategies for logs and traces without sacrificing observability for incidents.
0 / 24 completedMODULE C4 · OBSERVABILITY & SRE
Three pillars: metrics (what), logs (what happened), traces (where)
Four golden signals: latency, traffic, errors, saturation
Track p99 not average — average hides tail latency outliers
Metric types: counter (rate()), gauge (current), histogram (percentiles)
RED method (services): Rate, Errors, Duration
USE method (resources): Utilization, Saturation, Errors
Structured JSON logs: trace_id field links logs to traces
Tail-based log sampling: keep 100% errors, sample 1% normal
Log retention tiers: hot (7d Elasticsearch), warm (30d), cold (90d Glacier)
Distributed tracing: trace, span, parent-child structure, trace_id propagation
B3 headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId
OpenTelemetry: vendor-neutral, SDK + Collector architecture
SLI: measured ratio of good events to total events
SLO: internal target — set based on user need, not technical ease
SLA: contractual commitment, always weaker than SLO with buffer
Error budget math: 99.9% SLO → 43.2 min/month budget
Error budget policy: green/yellow/red/exhausted — what each means for deploys
Alert on symptoms not causes — actionable + urgent
Burn rate alerting: fast (1h, 14.4×) = page; slow (6h, 6×) = ticket
Alert fatigue: >1 page/shift = alert needs tuning
Incident roles: IC (coordinates), TL (debugs), comms, scribe
Blameless postmortem: 5 sections, no personal blame, 5 Whys
Chaos engineering: steady state → inject failure → observe → validate
✏️ Tasks 1–4 completed (SLO design, alerting, tracing, observability platform)
// NEXT MODULE
C5 — Security Architecture
OAuth2 / OIDC · JWT deep dive · Zero-Trust architecture
Secrets management · mTLS · API security · OWASP Top 10
Rate limiting for abuse prevention · DDoS mitigation
← C3 ML Systems 📄 Study Notes ↑ Roadmap C5 Security →