Chapter 13: Observability and SLO-lite Operations

Updated

July 30, 2026

Word target: 3,200
Primary deliverable: Baseline telemetry stack and alert runbook
Key diagrams: Metrics/logs/alerts architecture

Objective: Deploy minimum observability stack.
Starting state: Services running without centralized telemetry.
Steps:
1. Add node and service metrics.
2. Configure centralized log collection.
3. Create three core alerts.
Evidence: Dashboards and alert test screenshots.
Exit criteria: Alerts fire and resolve as expected.
Rollback: Remove alert rules causing noise.

Objective: Add SLO error budget view.
Starting state: Basic telemetry active.
Steps:
1. Define SLI/SLO for two critical services.
2. Add burn-rate alerting.
3. Tie alerts to incident runbook actions.
Evidence: SLO dashboards + incident drill logs.
Exit criteria: Error-budget visibility in weekly ops review.
Rollback: Revert to baseline alert set.

Author Gap Check

Include “what not to monitor yet” to keep MVP scope realistic.