Cover the basics
Check status code, latency, and TLS expiry for every public endpoint.
Monitor auth token issuance and refresh flows to avoid surprise 401 errors.
Checklist
- p95 latency under 500ms
- Token refresh success rate above 99%
- No 429 spikes during deploys
Guard third parties
Create separate monitors for gateways and downstream SaaS to isolate blame.
Trigger graceful degradations before customers are blocked.
Alert on error ratios and retry queues before outright downtime.
Bake into delivery
Gate deployments on synthetic health and auto roll back when health fails.
Send incident drafts to status pages when API error budgets burn fast.
