Database Failover Runbooks: Keeping Operations Smooth During Downtime
Learn how to execute a safe database failover without data loss. Discover how to use Watch.dog to verify that your standby node is ready before you promote it to master.
The Split-Brain Danger
# Step 1: Verify Master health
curl -I http://master-node/health # Result: 503
# Step 2: Shutdown Master forcefully (STONITH)
# Step 3: Verify Standby replication lag: < 1s.In a failover, the most dangerous scenario is 'Split-Brain'—where two nodes believe they are the Master and start accepting writes. This leads to permanent data corruption.
Your failover runbook must be strictly sequential, and your monitoring must be able to verify that the 'Dead' node is truly unreachable before promoting the 'Standby'.
Verified Promotion
[INFO] Watch.dog: Primary DB heartbeat MISSED.
[ACTION] Applying Runbook #44: Promotion to SlaveA.
[INFO] Checking Replication Consistency... OK.
[SUCCESS] SlaveA promoted to Master. All 12 global nodes confirm connectivity.Zero-Downtime Strategy
Modern architectures use a 'Smart Proxy' like ProxySQL or pgBouncer. Watch.dog monitors the proxy health, ensuring that your application doesn't even know a failover occurred.
Failover Maturity Levels
| Type | Manual Effort | Uptime Impact |
|---|---|---|
| Manual DNS Change | High (Hours) | Major Outage |
| Automated Script | Medium (Minutes) | Transient Failure |
| Orchestrated Failover | Low (Seconds) | Zero / Minimal Impact |
