Watch limits
Monitor concurrency, throttles, and timeouts per function. Alert when invocation duration approaches the timeout or when retries climb after deployments.
Track memory and cold-start latency per endpoint, not just averages. A few heavy functions can blow up p99s and exhaust regional concurrency.
Watch downstream dependencies: VPC cold starts, database connection pools, and third-party APIs all affect uptime long before your function throws.
Warm the hot paths
Schedule warmers for login, checkout, and webhook processors. Run them from the same subnets and configuration as production so you exercise the real path.
Use provisioned capacity for peak windows and critical flows. Pair with autoscaling limits that prevent noisy tenants from starving everyone else.
Cache secrets, config, and dependencies wisely—lazy load big SDKs, but keep auth tokens refreshed before they expire mid-execution.
Serverless essentials
- Cold start tracking with percentile alerts
- Dependency health checks and timeout alignment
- Fallback paths to queues when downstream slows
Connect to SLOs
Tie function-level errors to customer-facing SLIs. A spike in retries should tell you which journey (signup, billing, notifications) is burning budget.
Publish status updates when retries risk SLA breach. Be explicit about which regions or tenants are impacted and what fallback is active.
Keep runbooks for failing over to queues, switching regions, or temporarily moving heavy workflows to batch.
Test the ugly failure modes
Simulate throttling, permission errors, and downstream 500s in staging to verify backoff and dead-letter queues behave. Ensure alerts fire before you exhaust retries.
Track cost during incidents. Unbounded retries can spike cloud bills even after customers see 200s; set budgets and hard limits.
