Pick safe blast radiuses
Start with dependency toggles or traffic shaping in non-critical paths, not full region kills.
Define clear rollback triggers and pair each experiment with the Watch.Dog monitors it should trip.
Good first chaos drills
- Intentionally expire a TLS certificate in staging
- Drop a single dependency host
- Throttle one API shard to 500ms+
Run, observe, and page
Dry run the scenario with on-call so they expect context and alerts.
Force alerts through the same paging path customers will trigger; note where notifications lag or lack context.
Close the loop fast
Capture gaps in monitors, dashboards, and runbooks within 24 hours.
Update Watch.Dog tags and routing so future chaos drills page the right owners automatically.
