Chapter 13: Observability and SLO-lite Operations

Word target: 3,200
Primary deliverable: Baseline telemetry stack and alert runbook
Key diagrams: Metrics/logs/alerts architecture

Learning Goals

  • Capture essential metrics, logs, and uptime signals.
  • Define practical service objectives for homelab scale.
  • Create alerts that are actionable, not noisy.

MVP Lab Worksheet

  • Objective: Deploy minimum observability stack.
  • Starting state: Services running without centralized telemetry.
  • Steps:
    1. Add node and service metrics.
    2. Configure centralized log collection.
    3. Create three core alerts.
  • Evidence: Dashboards and alert test screenshots.
  • Exit criteria: Alerts fire and resolve as expected.
  • Rollback: Remove alert rules causing noise.

Advanced Lab Worksheet

  • Objective: Add SLO error budget view.
  • Starting state: Basic telemetry active.
  • Steps:
    1. Define SLI/SLO for two critical services.
    2. Add burn-rate alerting.
    3. Tie alerts to incident runbook actions.
  • Evidence: SLO dashboards + incident drill logs.
  • Exit criteria: Error-budget visibility in weekly ops review.
  • Rollback: Revert to baseline alert set.
Author Gap Check

Include “what not to monitor yet” to keep MVP scope realistic.