Two speeds of alert
Use fast-burn thresholds (for example 14x over 5 minutes) to catch sudden failures, and slow-burn thresholds (2–3x over 1–6 hours) to surface creeping risk before customers pile into support.
Route both to a small primary rotation plus a product sponsor so decisions balance customer impact and business tradeoffs. Make the alert tell them if you are already outside the error budget for the period.
Tie burn alerts to deployment freezes automatically when thresholds trip, and unlock only after mitigation or an explicit override.
Attach runbooks
Include clear rollback steps, feature flag toggles, and links to the dashboards that prove success. If the alert fires outside business hours, the instructions should be runnable by any on-call engineer.
Log burn causes and mitigations in a simple runbook entry for each incident. That history improves forecasting, budget planning, and whether the thresholds remain right for the system's current shape.
Keep the alert payload concise: which SLI is burning, how much budget is left, what changed recently, and when the next escalation happens.
Alert ingredients
- SLI and current/target burn rate with links to the graph
- Current budget left and projected exhaustion time
- Next escalation time and the channel/people it will page
Show context
Display burn rate on incident status pages so customers understand whether the issue is contained or still growing.
Pause deploys when burn exceeds a threshold until SLOs recover and health checks have stayed green for a defined window.
Review every burn alert weekly: did it fire for the right reason, was the response fast, and should thresholds be tuned?
Test and tune regularly
Replay past incidents or synthetically spike error rates in a lower environment to confirm burn alerts fire when expected.
Calibrate thresholds by product area; customer signup SLOs deserve faster burns than internal reporting dashboards. Document why each burn pair exists so newcomers trust them.
