Foundations

Database Failover Runbooks: Keeping Operations Smooth During Downtime

Learn how to execute a safe database failover without data loss. Discover how to use Watch.dog to verify that your standby node is ready before you promote it to master.

By Watch Dog TeamPublished March 5, 202513 min read

The Split-Brain Danger

Symptom Log

failover_logic.sh

# Step 1: Verify Master health
curl -I http://master-node/health # Result: 503
# Step 2: Shutdown Master forcefully (STONITH)
# Step 3: Verify Standby replication lag: < 1s.

In a failover, the most dangerous scenario is 'Split-Brain'—where two nodes believe they are the Master and start accepting writes. This leads to permanent data corruption.

Your failover runbook must be strictly sequential, and your monitoring must be able to verify that the 'Dead' node is truly unreachable before promoting the 'Standby'.

Verified Promotion

Configure Watch.dog Dual Monitors. One to track the Master's death and a second one to verify that the Standby has successfully taken over the floating IP and is responding to queries.

Fix Verification

failover_execution.log

[INFO] Watch.dog: Primary DB heartbeat MISSED.
[ACTION] Applying Runbook #44: Promotion to SlaveA.
[INFO] Checking Replication Consistency... OK.
[SUCCESS] SlaveA promoted to Master. All 12 global nodes confirm connectivity.

Zero-Downtime Strategy

Modern architectures use a 'Smart Proxy' like ProxySQL or pgBouncer. Watch.dog monitors the proxy health, ensuring that your application doesn't even know a failover occurred.

Failover Maturity Levels

Type	Manual Effort	Uptime Impact
Manual DNS Change	High (Hours)	Major Outage
Automated Script	Medium (Minutes)	Transient Failure
Orchestrated Failover	Low (Seconds)	Zero / Minimal Impact

In SRE, a disaster is just an unpracticed failover.

The Split-Brain Danger

Verified Promotion

Zero-Downtime Strategy

Failover Maturity Levels

Practice your Failovers