Pick safe candidates
Automate fixes you already run manually: service restarts, cache flushes, or pod replacements.
Require prechecks and postchecks with Watch.Dog synthetics before declaring success.
Great automation is boring, reversible, and observable.
Build guardrails
Limit concurrency, add circuit breakers, and page humans when retries exceed thresholds.
Log every action with the incident it was tied to for fast audits.
Measure the impact
Track mean time to recovery improvements and error budget saved per automation.
Retire automations that no longer match the architecture or failure modes.
