Map blast radius
List which customer groups rely on each region and which services are global.
Create region specific status page components so messaging stays accurate.
Design decisions
- Active active vs active passive
- Database replication lag budgets
- Per region maintenance windows
Health checks and routing
Use DNS health checks with short TTLs and keep synthetic probes per region.
Automate status page updates when routing policy shifts.
Never share a single alert channel per region; split by owner and on call rotation.
Practice failovers
Run quarterly gamedays that simulate region loss and capture MTTR.
Export evidence to SLA reports to prove resilience.
