Observability

Chaos Engineering Game Days: Breaking Things on Purpose to Improve Uptime

Learn how to organize Game Days for your engineering team. Practice incident response and verify your Watch.dog alerting paths before a real outage happens.

By Watch Dog TeamPublished April 5, 202612 min read

The Fear of the Unknown

Symptom Log
chaos_drill.sh
# Simulating a network partition in staging
tc qdisc add dev eth0 root netem loss 50% 
# Status: 50% packet loss. Site is crawl-speed.
# Team Question: Is Watch.dog alerting on latency yet?

Most teams are afraid of outages because they've never practiced for them. A 'Game Day' is a scheduled exercise where you intentionally break a non-production (or staging) environment to see how your monitoring and team react.

A common discovery is that while the monitor detected the failure, the alert was sent to a 'muted' channel or an engineer who was on vacation.

Fix: Alert Verification
Use Watch.dog Test Alerts to simulate critical failures twice a month. Verify that your Slack Webhooks, SMS, and Voice calls are actually reaching the right people.
Fix Verification
drill_success.log
[INFO] Chaos test initiated: Terminating DB secondary.
[12:01] Watch.dog: Alert triggered (Critical).
[12:02] Team: Acknowledged via Slack. Applying failover runbook.
[SUCCESS] MTTR during drill: 4 minutes. System stabilized.

Building the Runbook

Every Game Day should result in an update to your 'Runbooks'—the step-by-step guides that your team follows during a real incident. Watch.dog's activity logs are the perfect source for these documents.

Chaos Drill Checklist

ScenarioExpected ResultActual Watch.dog Insight
DB Connection FailureInstant 500 AlertHandshake Timeout Logged
Cloud Provider OutageMulti-region AlertVerification via Global Nodes
Slow API PerformanceP95 Latency AlertTrace ID injected into Alert
In SRE, we don't hope for the best; we practice for the worst.

Start your Chaos Training

Don't wait for a disaster. Start practicing your incident response with Watch.dog today.