Map failure states to monitors
List primary, secondary, and manual fallback paths for each tier-one service.
Attach Watch.Dog synthetics to every path and tag them by region and dependency.
Scenarios to rehearse
- Region isolation with DNS or load balancer flips
- Read-only database failover with app feature flags
- Third-party outage with cached responses
Run reversible drills
Automate traffic shifts behind a toggle and alert on the same channels production would use.
Record times from detection to rollback so you know where humans slow recovery.
Score and close gaps
Track drill outcomes in Watch.Dog with owners and next actions.
Add runbook links directly into alerts to shorten investigation during the next switch.
