Resilience

Chaos Game Days That Prove Your Uptime Defenses

Design reversible experiments to ensure monitors, alerts, and failover actually work.

By Jordan BlakePrincipal Reliability Engineer|Published December 23, 2025|6 min read

Pick safe blast radiuses

Start with dependency toggles or traffic shaping in non-critical paths, not full region kills.

Define clear rollback triggers and pair each experiment with the Watch.Dog monitors it should trip.

Good first chaos drills

Dry run the scenario with on-call so they expect context and alerts.

Force alerts through the same paging path customers will trigger; note where notifications lag or lack context.

The goal is not failure; it is confidence that detection and response are fast enough.

Capture gaps in monitors, dashboards, and runbooks within 24 hours.

Update Watch.Dog tags and routing so future chaos drills page the right owners automatically.

Deploy monitors, share beautiful status pages, and automate incident narratives with Watch Dog.