SLOs

Burn Rate Alerts That Protect Uptime

Design burn alerts that wake you up only when needed.

By Alex KimHead of Reliability|Published November 15, 2025|7 min read

Workspace with notebook and laptop for writing uptime guides

Two speeds of alert

Use fast-burn thresholds (for example 14x over 5 minutes) to catch sudden failures, and slow-burn thresholds (2–3x over 1–6 hours) to surface creeping risk before customers pile into support.

Route both to a small primary rotation plus a product sponsor so decisions balance customer impact and business tradeoffs. Make the alert tell them if you are already outside the error budget for the period.

Tie burn alerts to deployment freezes automatically when thresholds trip, and unlock only after mitigation or an explicit override.

Attach runbooks

Include clear rollback steps, feature flag toggles, and links to the dashboards that prove success. If the alert fires outside business hours, the instructions should be runnable by any on-call engineer.

Log burn causes and mitigations in a simple runbook entry for each incident. That history improves forecasting, budget planning, and whether the thresholds remain right for the system's current shape.

Keep the alert payload concise: which SLI is burning, how much budget is left, what changed recently, and when the next escalation happens.

Alert ingredients

SLI and current/target burn rate with links to the graph
Current budget left and projected exhaustion time
Next escalation time and the channel/people it will page

Show context

Display burn rate on incident status pages so customers understand whether the issue is contained or still growing.

Pause deploys when burn exceeds a threshold until SLOs recover and health checks have stayed green for a defined window.

Review every burn alert weekly: did it fire for the right reason, was the response fast, and should thresholds be tuned?

Burn alerts should be rarer than error alerts but carry more authority.

Test and tune regularly

Replay past incidents or synthetically spike error rates in a lower environment to confirm burn alerts fire when expected.

Calibrate thresholds by product area; customer signup SLOs deserve faster burns than internal reporting dashboards. Document why each burn pair exists so newcomers trust them.

Article stats

Author: Alex Kim
Role: Head of Reliability
Published: November 15, 2025
Reading time: 7 min

Put this into practice

Deploy monitors, share beautiful status pages, and automate incident narratives with Watch Dog.

Start for free

Burn Rate Alerts That Protect Uptime

Two speeds of alert

Attach runbooks

Show context

Test and tune regularly

Article stats

Tags

Related reading

Put this into practice

Launch reliable uptime monitoring with Watch.Dog

Don't wait more