SYSTEM DESIGN MASTERY · TRACK C · MODULE C4 · WEEK 28 METRICS · LOGS · TRACES · SLO/SLI/SLA · INCIDENT RESPONSE · CHAOS

// TRACK C · ADVANCED TOPICS · SITE RELIABILITY ENGINEERING

Observability
& SRE

THREE PILLARS · FOUR GOLDEN SIGNALS · SLO/SLI/SLA
ERROR BUDGETS · INCIDENT RESPONSE · CHAOS ENGINEERING

PILLARS

GOLDEN SIGNALS

99.9%

EXAMPLE SLO

MODULE

Metrics

Logs

Traces

SLI / SLO / SLA

Error Budget

Alerting

Incident Response

Chaos Engineering

The Three Pillars of Observability

Each pillar answers a different question — you need all three

📊

Metrics

"WHAT IS THE SYSTEM DOING?"

Aggregated numeric measurements over time. Low storage cost. Excellent for dashboards and alerting. Cannot tell you WHY something is wrong — only THAT something is wrong.

Tools: Prometheus, Datadog,
CloudWatch, Graphite, VictoriaMetrics

📋

Logs

"WHAT HAPPENED IN DETAIL?"

Timestamped records of discrete events. High storage cost. Rich per-event context. Hard to query at scale. Essential for debugging specific incidents once you know where to look.

Tools: ELK (Elasticsearch+Kibana),
Splunk, Loki+Grafana, CloudWatch Logs

🔍

Traces

"WHERE IN THE SYSTEM IS IT SLOW?"

End-to-end journey of a single request across all microservices. Shows latency waterfall. Answers: "Which service is slow?" and "Which dependency is the bottleneck?"

Tools: Jaeger, Zipkin, AWS X-Ray,
Datadog APM, OpenTelemetry

The key rule: Metrics tell you SOMETHING is wrong. Logs tell you WHAT happened in a specific component. Traces tell you WHERE in the distributed system the problem lives. An on-call engineer uses all three in sequence: metric alert fires → trace shows which service → logs reveal the root cause.

Metrics — The Four Golden Signals

Google SRE Book — the four metrics that matter most for any service

LATENCY

How long does a request take? Track p50, p95, p99, p999 — never average (hides outliers). Separate successful latency from error latency.

→ p99 API response time
→ p999 DB query duration
→ Error latency tracked separately

TRAFFIC

How much demand is the system receiving? Know your peak — design for 2–3× current peak to handle traffic spikes safely.

→ HTTP requests/sec per endpoint
→ Messages/sec through Kafka
→ Bytes/sec read from disk

ERRORS

What fraction of requests are failing? Track error RATE not raw count. Distinguish 4xx (client) from 5xx (server) errors.

→ HTTP 5xx rate per endpoint
→ Failed Kafka consumer events
→ DB transaction rollback rate

SATURATION

How "full" is the system? The resource closest to capacity. Saturation predicts problems before they cause errors or timeouts.

→ DB connection pool utilization %
→ CPU / memory utilization
→ Kafka consumer lag

Metric types — Prometheus data modelPROMETHEUS

// COUNTER — monotonically increasing. Use rate() to get per-second rate.
http_requests_total{method="GET", status="200"} 145231
rate(http_requests_total[5m])  → requests/sec over 5-min window

// GAUGE — current value, can go up or down. Query directly.
process_memory_bytes 524288000          → 500MB currently in use
db_connection_pool_active 45            → 45 of 100 connections in use

// HISTOGRAM — distribution in buckets. Enables percentile calculation.
http_request_duration_seconds_bucket{le="0.05"}  8920   ≤50ms: 8920 requests
http_request_duration_seconds_bucket{le="0.1"}   9543   ≤100ms: 9543 requests
http_request_duration_seconds_bucket{le="0.5"}   9981   ≤500ms: 9981 requests
http_request_duration_seconds_bucket{le="Inf"}  10000   total: 10000 requests
// p99: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

// USE method (for resources): Utilization, Saturation, Errors
// RED method (for services): Rate, Errors, Duration

Logs — Structured & Searchable

JSON structured logs beat unstructured text at every scale

Structured log — the correct formatJSON LOG

{
  "timestamp":   "2025-03-07T14:23:45.123Z",
  "level":        "ERROR",
  "service":      "payment-service",
  "version":      "v2.4.1",
  "trace_id":     "abc123def456",    ← links to distributed trace
  "span_id":      "7890abcd",
  "user_id":      "u_98765",          ← hashed if GDPR-sensitive
  "order_id":     "ord_12345",
  "message":      "Payment processing failed",
  "error_code":   "CARD_DECLINED",
  "amount_cents": 9999,
  "duration_ms":  234,
  "host":         "payment-pod-7d4b9c"
}

// log levels: DEBUG (dev) → INFO (business events) → WARN (recoverable)
//             → ERROR (needs investigation) → FATAL (immediate action)

// Log pipeline: App → Fluentd/Logstash → Kafka (buffer) → Elasticsearch → Kibana
// Cheaper alternative: App → Promtail → Loki (label-based) → Grafana

// TAIL-BASED SAMPLING (preferred):
// Keep 100% of ERROR/WARN logs. Sample 1% of INFO. Discard DEBUG.
// Preserves signal, reduces storage cost by ~50x on high-traffic services.

// Retention tiers:
// Hot  (Elasticsearch):  7 days   → fast full-text search, recent incidents
// Warm (S3-backed):      30 days  → slower queries, post-incident review
// Cold (S3 Glacier):     90 days  → compliance only, rarely queried

Distributed Tracing

Follow a single request end-to-end across every service it touches

Trace waterfall — spotting the bottleneck at a glanceTRACE EXAMPLE

// Trace: place_order request (trace_id: abc123)
// Each bar = one span (one service's processing time)

Span 1: API Gateway          |████████████████████████| 95ms total
  Span 2: Auth Service         |██| 3ms
  Span 3: Order Service          |████████████████| 80ms total
    Span 4: Feature Store call     |████| 8ms
    Span 5: MySQL write            |████████████| 60ms  ← BOTTLENECK
    Span 6: Kafka publish          |██| 4ms
  Span 7: Notification Service   |████| 8ms

// Without tracing: "order service is slow" — check all of its dependencies
// With tracing: "MySQL write is taking 60ms" — go check DB explain plan, indexes

// Context propagation via HTTP headers (B3 format, used by Zipkin/Jaeger):
X-B3-TraceId:    abc123def456789   ← same for all spans in this trace
X-B3-SpanId:     7890abcd           ← unique per span
X-B3-ParentSpanId: 1234efgh         ← parent span's ID
X-B3-Sampled:    1                  ← 1=sample this trace, 0=don't

// OpenTelemetry (OTel): vendor-neutral standard
// Write instrumentation once → export to Jaeger, Datadog, or any backend
// SDK: Java, Python, Go, Node.js — all supported

// Sampling strategy:
// Head-based: decide at trace root (random 1%) — simple but misses rare errors
// Tail-based: decide after trace completes — keep 100% of errors/slow traces
// Preferred: tail-based with head-based as fallback

SLI / SLO / SLA

Define what "reliable" means before you can measure or improve it

SLI

Service Level Indicator

THE MEASURED METRIC

The actual number you track. A ratio of good events to total events. Example: SLI = (requests returning <200ms) / total_requests × 100% Good SLIs: availability %, latency %, freshness %, correctness % Bad SLIs: raw request count, uptime of a single server

SLO

Service Level Objective

YOUR INTERNAL TARGET

Your target for the SLI. Set based on what users need, not what's technically easy. Example: "99.9% of homepage requests complete in <200ms over a 28-day window" SLOs are internal — no contracts, no penalties. Drives engineering decisions.

SLA

Service Level Agreement

THE CONTRACT

Contractual commitment to customers. With penalties (credits, refunds) if breached. SLA MUST be weaker than SLO — leave a buffer for incidents + measurement gaps. Rule: if SLO = 99.9%, set SLA = 99.5%. Never set SLA = SLO.

Error budget calculation — the math behind the policyMATH

// Error budget = 100% - SLO = allowed failure rate

SLO = 99.9%  →  error budget = 0.1% of requests may fail
  Over 30 days: 0.001 × 30d × 24h × 60m = 43.2 minutes total downtime allowed

SLO = 99.99% →  error budget = 0.01%
  Over 30 days: 4.38 minutes total downtime allowed

SLO = 99.5%  →  error budget = 0.5%
  Over 30 days: 3.6 hours total downtime allowed

// Choose SLO based on tier, not ambition:
Critical path (payment, auth login):  99.99%  →  4.38 min/month
Core product (feed, search, API):     99.9%   → 43.2 min/month
Non-critical (analytics, recs, admin):99.5%   →  3.6 hr/month

// Tighter SLO = fewer feature deployments = slower iteration
// Setting 99.99% for everything = no deploys ever = engineering paralysis

Error Budget Policy

Turns reliability into an objective, data-driven conversation between product and engineering

●

> 50%

GREEN — HEALTHY

Deploy freely. Take calculated risks. Ship experimental features. Iterate fast. The budget is there to be spent on innovation.

●

10–50%

YELLOW — CAUTION

Proceed with caution. Extra review on all deployments. No risky or experimental changes. Fix known reliability issues in next sprint.

●

< 10%

RED — DANGER

Freeze all non-critical feature deployments. Full reliability focus. All engineering effort on reducing error rate and technical debt.

●

EXHAUSTED

SLA AT RISK

Escalate to leadership. Full incident mode until budget resets. Customer credits may be triggered. Post-mortem required before any new deploys.

Why error budgets work: Without them, product says "ship faster" and engineering says "we need reliability." Both are right but can't agree. With an error budget, the conversation becomes objective: "We have 12 minutes of budget left this month. Your feature has a 20% chance of causing a 5-minute incident. The math says no." Product can accept math — they can't always accept "engineering intuition."

Alerting Design

Alert on symptoms, not causes — every alert must be actionable and urgent

LEVEL	WHEN TO USE	RESPONSE	EXAMPLE
PAGE	SLO breach occurring or imminent. Users impacted NOW.	Wake someone up immediately. Drop everything.	Error rate >1% for 5 min on payments
TICKET	Degraded performance, not yet breaching SLO. Trend concerning.	Address next business day. No 3am wake-up.	p99 latency +30% (still within SLO)
LOG	Informational. No action needed. Useful for dashboards only.	Review weekly. No alert sent.	Cache hit rate dropped 5%

Burn rate alerting — the Google SRE recommended approachPROMETHEUS ALERT RULES

// Burn rate = how fast you're consuming the error budget
// Burn rate 1.0 = consuming at exactly the SLO rate (budget depletes at month end)
// Burn rate 14.4 = consuming 14.4× faster → 1hr window consumes 2% of 30-day budget

// FAST BURN — page immediately (short window catches acute outage)
alert: ErrorBudgetBurnFast
expr: rate(http_errors[1h]) / rate(http_requests[1h]) > (14.4 * 0.001)
       ^^ error rate            ^^ burns budget 14.4× faster than allowed
for: 2m
severity: page
message: "Fast error budget burn — action required immediately"

// SLOW BURN — urgent ticket (longer window catches gradual degradation)
alert: ErrorBudgetBurnSlow
expr: rate(http_errors[6h]) / rate(http_requests[6h]) > (6 * 0.001)
for: 15m
severity: ticket

// BAD ALERT — avoid "CPU > 80%"
// CPU can be high while system is healthy (batch job running)
// CPU can be low while system is broken (stuck waiting for DB)
// Alert on what users experience, not what internal resources are doing

// Alert fatigue rule: if alert fires >1/shift → raise threshold, add duration, or demote

Incident Response

A structured process ensures fast mitigation and organizational learning

SEV1

CRITICAL

Complete outage. All customers impacted. C-suite aware. All hands on deck. Revenue loss per minute.

SEV2

MAJOR

Core feature broken. Large % of customers impacted. Urgent escalation. Primary on-call leads.

SEV3

MINOR

Partial degradation. Workaround available. Small % impacted. Address during business hours.

SEV4

LOW

Minor issue. No immediate impact. Tracked in backlog. Fix in next sprint.

// INCIDENT LIFECYCLE — SEV1 example

T+0m

Alert fires. On-call pages. SEV1 declared. IC assigned.

T+5m

IC assembles team. Scribe opens incident doc. Comms updates status page: "Investigating."

T+12m

TL identifies root cause: bad deploy at T-3m. New model causing OOM on payment pods.

T+15m

MITIGATION: rollback deploy. Error rate begins dropping. IC: "Stop the bleeding first."

T+22m

Error rate returns to baseline. Service restored. Status page updated: "Resolved."

T+48h

Blameless postmortem published. 5 action items filed. No personal blame. 5 Whys completed.

Blameless postmortem structure: (1) Impact — who affected, for how long, user experience. (2) Timeline — minute-by-minute. (3) Root cause — technical cause using 5 Whys. (4) Contributing factors — systemic issues. (5) Action items — specific, assigned, time-bound. Never write "human error" as root cause — humans made reasonable decisions with available information. Fix the system so the same decision doesn't cause the same failure.

Chaos Engineering

Break things deliberately in a controlled way — before production breaks them for you

Chaos Engineering principle: Define a steady state (your SLI baseline). Inject a realistic failure. Observe whether the system maintains steady state. If it does — resilience confirmed. If it doesn't — you found a weakness before your users did. Fix it. Then experiment again.

Pod Kill

Terminate a random pod mid-traffic. Does the deployment self-heal? Does the load balancer route around it? How fast?

Expected: new pod starts in <30s
Alert: if error rate spikes >1%
Tool: Chaos Monkey, LitmusChaos

AZ Failure

Cut all traffic to one Availability Zone. Does the system reroute to the other AZs? Does health check remove the AZ?

Expected: <30s reroute, <1% errors
Alert: latency increase during switch
Tool: AWS FIS, network rules

Latency Injection

Add 500ms to all DB calls. Do timeouts fire correctly? Do circuit breakers open? Does the UI show degraded state?

Expected: timeout after 1s, circuit opens
Alert: p99 latency alert fires correctly
Tool: TC netem, Toxiproxy

Dependency Down

Take down a non-critical service (notifications). Does the core path (payment) continue working? Is graceful degradation implemented?

Expected: core path unaffected
Notification failures logged but ignored
Tool: iptables, AWS FIS

Resource Exhaustion

Fill disk to 95% on a node. Does the system alert? Does it continue serving? Does log rotation prevent full disk?

Expected: alert at 85%, graceful at 95%
No data loss from full disk
Tool: dd, stress-ng

CPU Spike

Max out CPU on one node. Does the HPA (Horizontal Pod Autoscaler) kick in? Does load balancer route less traffic to overloaded pod?

Expected: scale-out in <2 min
p99 latency increase <50%
Tool: stress-ng, k6 load test

SLO Design for E-Commerce Checkout

~1.5 hrs

›

Define 3 SLIs for checkout: availability (% success), latency (% under threshold), and freshness (cart data staleness). Write each as a measurable formula.
Set SLOs for each. Checkout is revenue-critical; users expect <2s page loads. What SLOs are appropriate?
Calculate error budgets (minutes/month) for each SLO you chose.
A deployment causes checkout errors for 8 minutes. How much of the availability budget is consumed? What budget level does that leave? What policy applies?
Product wants to ship a major checkout redesign with 10% risk of incident. Current error budget consumed: 70%. Make and justify your recommendation.

Alerting Strategy for URL Shortener (B5)

~1 hr

›

Define all four golden signals for the URL shortener. What specific metric represents each signal for this service?
Write 3 alert rules with thresholds and duration windows: one for availability (error rate), one for latency (p99), one for saturation (Redis memory).
The redirect endpoint returns 404 for 2% of short URLs. Is this a page, ticket, or log? At what 404 rate would it become a page?
An on-call engineer receives 15 alerts per shift. What is wrong? Name three specific fixes to reduce noise without losing signal.

Distributed Tracing for WhatsApp (B7)

~1.5 hrs

›

Define the spans for a message send flow: client → WebSocket server → Kafka → router → chat server → recipient. What is the parent-child relationship?
Where do you propagate trace context? What HTTP headers? How do you pass trace context through Kafka (hint: Kafka message headers)?
A user reports "my message took 10 seconds to deliver." Without tracing, how would you debug? With tracing, what specifically do you look for in the waterfall?
You're sampling 1% of traces. The slow 10-second delivery is in the other 99%. How do you ensure this specific trace was captured? What sampling strategy?

★

Design a Self-Hosted Observability Platform

~3 hrs

›

500 engineers, 5000 microservices, 10 TB/day logs, 1M metrics series, millions of traces/day.

Metrics pipeline: instrumentation → collection → storage → alerting. Technology choices, scale numbers, retention strategy.
Log pipeline: ingestion → buffering → indexing → tiered retention. How do you handle 10 TB/day economically? What does the Elasticsearch cluster look like?
Trace pipeline: instrumentation (OTel) → sampling strategy → storage → query backend. What's your storage choice for traces and why?
Unified correlation: when an alert fires, how does the on-call engineer navigate from metric → relevant logs → relevant trace in under 60 seconds?
SLO management at scale: storing, tracking, and alerting on error budgets for 5000 services. Can you use Prometheus for this? What's the schema?
Top 3 cost optimization strategies for logs and traces without sacrificing observability for incidents.

0 / 24 completedMODULE C4 · OBSERVABILITY & SRE

Three pillars: metrics (what), logs (what happened), traces (where)

Four golden signals: latency, traffic, errors, saturation

Track p99 not average — average hides tail latency outliers

Metric types: counter (rate()), gauge (current), histogram (percentiles)

RED method (services): Rate, Errors, Duration

USE method (resources): Utilization, Saturation, Errors

Structured JSON logs: trace_id field links logs to traces

Tail-based log sampling: keep 100% errors, sample 1% normal

Log retention tiers: hot (7d Elasticsearch), warm (30d), cold (90d Glacier)

Distributed tracing: trace, span, parent-child structure, trace_id propagation

B3 headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId

OpenTelemetry: vendor-neutral, SDK + Collector architecture

SLI: measured ratio of good events to total events

SLO: internal target — set based on user need, not technical ease

SLA: contractual commitment, always weaker than SLO with buffer

Error budget math: 99.9% SLO → 43.2 min/month budget

Error budget policy: green/yellow/red/exhausted — what each means for deploys

Alert on symptoms not causes — actionable + urgent

Burn rate alerting: fast (1h, 14.4×) = page; slow (6h, 6×) = ticket

Alert fatigue: >1 page/shift = alert needs tuning

Incident roles: IC (coordinates), TL (debugs), comms, scribe

Blameless postmortem: 5 sections, no personal blame, 5 Whys

Chaos engineering: steady state → inject failure → observe → validate

✏️ Tasks 1–4 completed (SLO design, alerting, tracing, observability platform)

// NEXT MODULE

C5 — Security Architecture

      OAuth2 / OIDC · JWT deep dive · Zero-Trust architecture

      Secrets management · mTLS · API security · OWASP Top 10

      Rate limiting for abuse prevention · DDoS mitigation

← C3 ML Systems 📄 Study Notes ↑ Roadmap C5 Security →