Foundations

Database Failover Runbooks: Keeping Operations Smooth During Downtime

Learn how to execute a safe database failover without data loss. Discover how to use Watch.dog to verify that your standby node is ready before you promote it to master.

By Watch Dog TeamPublished March 5, 202513 min read

The Split-Brain Danger

Symptom Log
failover_logic.sh
# Step 1: Verify Master health
curl -I http://master-node/health # Result: 503
# Step 2: Shutdown Master forcefully (STONITH)
# Step 3: Verify Standby replication lag: < 1s.

In a failover, the most dangerous scenario is 'Split-Brain'—where two nodes believe they are the Master and start accepting writes. This leads to permanent data corruption.

Your failover runbook must be strictly sequential, and your monitoring must be able to verify that the 'Dead' node is truly unreachable before promoting the 'Standby'.

Verified Promotion
Configure Watch.dog Dual Monitors. One to track the Master's death and a second one to verify that the Standby has successfully taken over the floating IP and is responding to queries.
Fix Verification
failover_execution.log
[INFO] Watch.dog: Primary DB heartbeat MISSED.
[ACTION] Applying Runbook #44: Promotion to SlaveA.
[INFO] Checking Replication Consistency... OK.
[SUCCESS] SlaveA promoted to Master. All 12 global nodes confirm connectivity.

Zero-Downtime Strategy

Modern architectures use a 'Smart Proxy' like ProxySQL or pgBouncer. Watch.dog monitors the proxy health, ensuring that your application doesn't even know a failover occurred.

Failover Maturity Levels

TypeManual EffortUptime Impact
Manual DNS ChangeHigh (Hours)Major Outage
Automated ScriptMedium (Minutes)Transient Failure
Orchestrated FailoverLow (Seconds)Zero / Minimal Impact
In SRE, a disaster is just an unpracticed failover.

Practice your Failovers

Don't find out your standby node is broken during a real crisis. Start monitoring your replication health with Watch.dog.