Chaos Engineering Game Days: Breaking Things on Purpose to Improve Uptime
Learn how to organize Game Days for your engineering team. Practice incident response and verify your Watch.dog alerting paths before a real outage happens.
The Fear of the Unknown
# Simulating a network partition in staging
tc qdisc add dev eth0 root netem loss 50%
# Status: 50% packet loss. Site is crawl-speed.
# Team Question: Is Watch.dog alerting on latency yet?Most teams are afraid of outages because they've never practiced for them. A 'Game Day' is a scheduled exercise where you intentionally break a non-production (or staging) environment to see how your monitoring and team react.
A common discovery is that while the monitor detected the failure, the alert was sent to a 'muted' channel or an engineer who was on vacation.
Fix: Alert Verification
[INFO] Chaos test initiated: Terminating DB secondary.
[12:01] Watch.dog: Alert triggered (Critical).
[12:02] Team: Acknowledged via Slack. Applying failover runbook.
[SUCCESS] MTTR during drill: 4 minutes. System stabilized.Building the Runbook
Every Game Day should result in an update to your 'Runbooks'—the step-by-step guides that your team follows during a real incident. Watch.dog's activity logs are the perfect source for these documents.
Chaos Drill Checklist
| Scenario | Expected Result | Actual Watch.dog Insight |
|---|---|---|
| DB Connection Failure | Instant 500 Alert | Handshake Timeout Logged |
| Cloud Provider Outage | Multi-region Alert | Verification via Global Nodes |
| Slow API Performance | P95 Latency Alert | Trace ID injected into Alert |
