Incident Response Checklist for Website Downtime
Your monitoring alert just fired. Your website is down. What do you do first?
Having a clear incident response plan means the difference between a 5-minute fix and a 2-hour scramble. Here's a battle-tested checklist for handling website outages.
Phase 1: Detection & Triage (0–5 minutes)
- [ ] Acknowledge the alert — let your team know someone is on it
- [ ] Verify the outage — check from multiple locations (not just your office network)
- [ ] Check scope — is it the whole site, specific pages, or specific regions?
- [ ] Check recent changes — was anything deployed in the last hour?
- [ ] Update status page — post "Investigating" within 5 minutes
Phase 2: Diagnosis (5–15 minutes)
- [ ] Check server health — CPU, memory, disk space
- [ ] Check application logs — look for errors, stack traces, OOM kills
- [ ] Check database — connections, slow queries, replication lag
- [ ] Check DNS — is your domain resolving correctly?
- [ ] Check SSL — has a certificate expired?
- [ ] Check external dependencies — APIs, payment gateways, CDN
- [ ] Check infrastructure — cloud provider status pages
Phase 3: Resolution (15+ minutes)
- [ ] Apply the fix — rollback, restart, scale, or patch
- [ ] Verify recovery — confirm from multiple regions
- [ ] Monitor stability — watch for 15 minutes after the fix
- [ ] Update status page — post "Monitoring" then "Resolved"
- [ ] Notify stakeholders — email affected customers if needed
Phase 4: Post-Mortem (within 48 hours)
- [ ] Document the timeline — what happened, when, and what was done
- [ ] Identify root cause — not just symptoms, but underlying cause
- [ ] List action items — what will prevent this from happening again?
- [ ] Share with the team — blameless post-mortem culture
- [ ] Update runbooks — add new knowledge to your documentation
Common Causes and Quick Fixes
| Symptom | Likely Cause | Quick Fix |
|---|---|---|
| 502 Bad Gateway | App crashed / not running | Restart the application |
| 503 Service Unavailable | Server overloaded | Scale up or restart |
| Connection timeout | Network / firewall issue | Check security groups, routes |
| SSL error | Certificate expired | Renew certificate |
| DNS not resolving | DNS misconfiguration | Check DNS records |
| Slow response (>5s) | Database bottleneck | Check slow queries, add index |
Communication Tips
What to Say
"We're experiencing an issue with [service]. Our team is investigating and we'll provide an update within 30 minutes."
What NOT to Say
"Everything is fine" (when it isn't) "It's the cloud provider's fault" (users don't care whose fault it is) Nothing at all (silence is the worst response)
When to Escalate
- Issue persists beyond 30 minutes with no root cause identified
- Data loss is suspected
- Security breach is possible
- Multiple critical services are affected
Prevention Checklist
The best incident response is preventing incidents in the first place:
- Monitor everything — uptime, SSL, DNS, response time
- Set up alerts on multiple channels (don't rely on just email)
- Test your monitoring — intentionally trigger an alert to verify it works
- Keep runbooks updated — document how to restart, rollback, and scale
- Practice incident response — run game days with your team
- Implement deployment safeguards — canary deploys, rollback scripts
Conclusion
Downtime happens to everyone. What sets great teams apart is how quickly and professionally they respond. Print this checklist, pin it to your team channel, and practice before you need it.