Build healthy rotations
Limit after-hours load with fair rotations and backup coverage. Rotate weekly, and use shadow weeks for new responders so they learn without the pager panic.
Measure responder load, page quality, and response times monthly to keep burnout low. If pages per shift climb, prune noisy monitors before people quit.
Give teams scheduled no-page hours for deep work; reliability improves when engineers are not constantly interrupted.
Tune paging rules
Route by service ownership and customer impact, not just severity labels. A signup outage should wake growth and infra, while a billing retry issue may only need finance and backend.
Escalate automatically when alerts go unacknowledged and log handoffs in the incident timeline. Include the checklist of what must be done before handing off a page.
Set quiet hours policies: combine monitors, widen thresholds, or page to chat first during low-risk windows.
Make runbooks clickable
Include one-click tests and rollback commands in every alert. The payload should surface dashboards, feature flags, and status page links in the first message.
Review runbooks after incidents to keep them accurate and short. Remove steps that no one uses and add the investigation paths that actually solved issues.
Track which alerts lacked runbooks. If responders googled instead of clicking, the system needs better documentation.
Measure and improve
Watch MTTA, MTTR, and page quality score (true positive vs. noise). Celebrate teams that delete monitors or automate fixes after noisy weeks.
Run quarterly incident drills that include paging. Time how long it takes to gather the right people, join the war room, and publish the first customer update.
