Chapter 13: Observability and SLO-lite Operations
Word target: 3,200
Primary deliverable: Baseline telemetry stack and alert runbook
Key diagrams: Metrics/logs/alerts architecture
Learning Goals
- Capture essential metrics, logs, and uptime signals.
- Define practical service objectives for homelab scale.
- Create alerts that are actionable, not noisy.
MVP Lab Worksheet
- Objective: Deploy minimum observability stack.
- Starting state: Services running without centralized telemetry.
- Steps:
- Add node and service metrics.
- Configure centralized log collection.
- Create three core alerts.
- Evidence: Dashboards and alert test screenshots.
- Exit criteria: Alerts fire and resolve as expected.
- Rollback: Remove alert rules causing noise.
Advanced Lab Worksheet
- Objective: Add SLO error budget view.
- Starting state: Basic telemetry active.
- Steps:
- Define SLI/SLO for two critical services.
- Add burn-rate alerting.
- Tie alerts to incident runbook actions.
- Evidence: SLO dashboards + incident drill logs.
- Exit criteria: Error-budget visibility in weekly ops review.
- Rollback: Revert to baseline alert set.
Author Gap Check
Include “what not to monitor yet” to keep MVP scope realistic.