Probe everything
Use liveness and readiness probes tuned per service to avoid bad rollouts. Readiness should fail fast on dependency issues; liveness should be forgiving enough to avoid kill loops.
Add synthetic canaries for critical workloads that hit the app through the ingress, not just the pod IP, to catch DNS, cert, and network issues.
Test probes under load and during deployments. Many outages come from probes that pass in staging but flap in real traffic.
Budget disruptions
Use PodDisruptionBudgets and surge settings to keep capacity during updates and autoscaler events. Set budgets based on actual load, not defaults.
Alert when nodes drop below safe thresholds or when evictions spike. Drain nodes with runbooks that respect PDBs and verify post-drain health.
Cap autoscaler limits so cost surprises or runaway scale-ups do not starve other clusters.
Cluster guardrails
- PDBs on critical services with tests
- Autoscaler limits with buffer headroom
- Node drain runbooks and checklists
Observe from the edge
Monitor ingress latency, TLS errors, and 429/503s at the edge, not just pod health. Customers feel load balancer and cert issues before pods die.
Test failover paths across regions regularly with traffic shifting. Validate DNS, certificates, and stateful workloads during these drills.
Record which namespaces and ingresses represent revenue paths so those alerts page faster.
Harden configs
Pin Kubernetes versions and roll upgrades gradually with health checks between steps. Backup and test etcd restores regularly.
Use network policies, resource requests/limits, and pod priority classes to keep noisy neighbors from knocking out critical services.
