Resilience
Chaos Game Days That Prove Your Uptime Defenses
Design reversible experiments to ensure monitors, alerts, and failover actually work.
By Jordan BlakePublished December 23, 20256 min read
Pick safe blast radiuses
Start with dependency toggles or traffic shaping in non-critical paths, not full region kills.
Define clear rollback triggers and pair each experiment with the Watch.Dog monitors it should trip.
Good first chaos drills
- Intentionally expire a TLS certificate in staging
- Drop a single dependency host
- Throttle one API shard to 500ms+
Run, observe, and page
Dry run the scenario with on-call so they expect context and alerts.
Force alerts through the same paging path customers will trigger; note where notifications lag or lack context.
The goal is not failure; it is confidence that detection and response are fast enough.
Close the loop fast
Capture gaps in monitors, dashboards, and runbooks within 24 hours.
Update Watch.Dog tags and routing so future chaos drills page the right owners automatically.