Incident Response Checklist for Website Downtime

Your monitoring alert just fired. Your website is down. What do you do first?

Having a clear incident response plan means the difference between a 5-minute fix and a 2-hour scramble. Here's a battle-tested checklist for handling website outages.

Phase 1: Detection & Triage (0–5 minutes)

[ ] Acknowledge the alert — let your team know someone is on it
[ ] Verify the outage — check from multiple locations (not just your office network)
[ ] Check scope — is it the whole site, specific pages, or specific regions?
[ ] Check recent changes — was anything deployed in the last hour?
[ ] Update status page — post "Investigating" within 5 minutes

Phase 2: Diagnosis (5–15 minutes)

[ ] Check server health — CPU, memory, disk space
[ ] Check application logs — look for errors, stack traces, OOM kills
[ ] Check database — connections, slow queries, replication lag
[ ] Check DNS — is your domain resolving correctly?
[ ] Check SSL — has a certificate expired?
[ ] Check external dependencies — APIs, payment gateways, CDN
[ ] Check infrastructure — cloud provider status pages

Phase 3: Resolution (15+ minutes)

[ ] Apply the fix — rollback, restart, scale, or patch
[ ] Verify recovery — confirm from multiple regions
[ ] Monitor stability — watch for 15 minutes after the fix
[ ] Update status page — post "Monitoring" then "Resolved"
[ ] Notify stakeholders — email affected customers if needed

Phase 4: Post-Mortem (within 48 hours)

[ ] Document the timeline — what happened, when, and what was done
[ ] Identify root cause — not just symptoms, but underlying cause
[ ] List action items — what will prevent this from happening again?
[ ] Share with the team — blameless post-mortem culture
[ ] Update runbooks — add new knowledge to your documentation

Common Causes and Quick Fixes

Symptom	Likely Cause	Quick Fix
502 Bad Gateway	App crashed / not running	Restart the application
503 Service Unavailable	Server overloaded	Scale up or restart
Connection timeout	Network / firewall issue	Check security groups, routes
SSL error	Certificate expired	Renew certificate
DNS not resolving	DNS misconfiguration	Check DNS records
Slow response (>5s)	Database bottleneck	Check slow queries, add index

Communication Tips

What to Say

"We're experiencing an issue with [service]. Our team is investigating and we'll provide an update within 30 minutes."

What NOT to Say

"Everything is fine" (when it isn't) "It's the cloud provider's fault" (users don't care whose fault it is) Nothing at all (silence is the worst response)

When to Escalate

Issue persists beyond 30 minutes with no root cause identified
Data loss is suspected
Security breach is possible
Multiple critical services are affected

Prevention Checklist

The best incident response is preventing incidents in the first place:

Monitor everything — uptime, SSL, DNS, response time
Set up alerts on multiple channels (don't rely on just email)
Test your monitoring — intentionally trigger an alert to verify it works
Keep runbooks updated — document how to restart, rollback, and scale
Practice incident response — run game days with your team
Implement deployment safeguards — canary deploys, rollback scripts

Conclusion

Downtime happens to everyone. What sets great teams apart is how quickly and professionally they respond. Print this checklist, pin it to your team channel, and practice before you need it.

Incident Response Checklist for Website Downtime

Incident Response Checklist for Website Downtime

Phase 1: Detection & Triage (0–5 minutes)

Phase 2: Diagnosis (5–15 minutes)

Phase 3: Resolution (15+ minutes)

Phase 4: Post-Mortem (within 48 hours)

Common Causes and Quick Fixes

Communication Tips

What to Say

What NOT to Say

When to Escalate

Prevention Checklist

Conclusion

Start monitoring your services for free

Related articles