Inventory and rank vendors
Start with an inventory that maps each user journey to the exact vendors it touches, plus the owning team and escalation path. Include SDKs, auth flows, billing calls, and the background jobs that retry or fan out across providers.
Score every dependency on customer blast radius, ease of replacement, and published SLA/OLA. Capture what "degraded" vs. "down" actually looks like for you—rate limiting, partial data, bad cache fills, stale auth tokens—and the time to mitigate each.
Document warm stand-ins or exit clauses next to each vendor. If a provider is revenue blocking, it needs a pre-agreed fallback UX, cached responses, and a person on-call to make the switch.
Monitor vendors like your own services
Deploy synthetics that mirror the payloads and headers you actually send, not hello-world pings. Watch rate-limit headers, auth flows, pagination, and critical edge cases like idempotency keys or webhook validation.
Add DNS, TLS, and endpoint reachability monitors for every vendor edge you depend on. Alert on contract drift: latency above their SLA, error codes outside their documentation, or payload shape changes.
Scrape vendor status pages and RSS feeds into Watch.Dog and map them to your monitors. That way you can auto-suppress duplicate noise, annotate incidents with upstream context, and decide faster whether to fail over.
Monitoring signals to add
- API contract check with latency and content validation
- Status page or RSS watcher with deduplicated alerts and severity mapping
- Heartbeat from your fallback path to prove failover works
- Rate-limit and auth token canary that exercises retries and backoff
Contain the blast radius
Ship circuit breakers and aggressive caching so your customers see graceful fallbacks instead of timeouts. Default to serving last-known-good results with expiry headers, then refresh in the background once the vendor is healthy.
Segment impact by customer tier and geography. Auto-pause noisy monitors when failover is active, and prioritize web vs. mobile flows differently if one channel can absorb more latency.
Keep a toggle-driven kill switch for every vendor feature. Deploy it as part of normal releases, not during an incident, so you trust it when you need it.
Practice vendor incidents
Run quarterly drills where a vendor rate-limits or returns corrupted data. Time how long it takes to swap DNS, change API domains, or flip feature flags. Capture which dashboards and alerts actually helped.
Pre-write customer update templates for vendor outages. Include what you monitor, what your fallback is, and how to contact support if customers see lingering issues after the vendor recovers.
What to record after each drill
- Mean time to detect via synthetic vs. status-page ingestion
- Steps required to flip to backup providers or cached responses
- Which alerts to auto-suppress when upstream is confirmed degraded
