The DevOps On-Call Survival Guide: Stay Sane While Keeping Systems Up
Being on-call is one of the most stressful parts of working in tech. Your phone can ring at 3 AM, your weekend plans are never certain, and the pressure to fix production issues quickly is intense. But it doesn't have to be miserable. Here's how to build an on-call culture that works.
Building a Sustainable On-Call Rotation
Team Size and Rotation Length
| Team Size | Recommended Rotation | On-Call Frequency |
|---|---|---|
| 2-3 people | 1-week rotations | Every 1-2 weeks |
| 4-6 people | 1-week rotations | Every 4-6 weeks |
| 7-10 people | 1-week rotations | Every 7-10 weeks |
| 10+ people | Consider sub-teams | Team-specific rotation |
Rule of thumb: No one should be on-call more than once every 4 weeks. More frequent than that leads to burnout.
Rotation Best Practices
- Overlap handoffs — 30-minute overlap between outgoing and incoming on-call
- Handoff document — what's currently broken, what's been flaky, what was deployed
- Shadow rotation — new team members shadow for 1-2 rotations before going solo
- Compensate fairly — on-call pay, comp time, or reduced hours
- No surprises — publish the schedule at least 2 months ahead
- Easy swaps — let people trade shifts without management approval
Escalation Chain
Level 1: Primary on-call (0-15 minutes)
↓ No response in 15 minutes
Level 2: Secondary on-call (15-30 minutes)
↓ No response in 15 minutes
Level 3: Team lead / Engineering manager
↓ P1 lasting > 1 hour
Level 4: VP Engineering / CTO
Reducing Alert Fatigue
Alert fatigue is the #1 reason on-call becomes unbearable. When every shift generates 20+ alerts, engineers start ignoring them — including the real ones.
The Alert Audit
Review every alert that fired in the last month:
| Category | Action |
|---|---|
| Actionable + Urgent | Keep as-is |
| Actionable + Not Urgent | Move to business hours |
| Not Actionable | Delete the alert |
| Flapping (on/off/on/off) | Fix root cause or add hysteresis |
| Duplicate | Consolidate |
Target: < 2 pages per on-call shift (outside business hours).
Alert Hygiene Rules
- Every alert must have a runbook — if you don't know what to do, the alert is useless
- Every alert must be actionable — if you can't do anything about it, it's not an alert
- Tune thresholds quarterly — as your system grows, thresholds need updating
- Use severity levels — not everything is P1
- Group related alerts — database slow + app timeout + error spike = one incident, not three
Smart Alert Routing
3 AM on Saturday:
P1 (site down) → Phone call + SMS + Telegram
P2 (degraded) → Telegram only
P3 (minor issue) → Queue for Monday morning
2 PM on Tuesday:
P1 → Slack + Phone call
P2 → Slack channel
P3 → Slack channel (no notification)
Handling an Incident
The First 5 Minutes
- Acknowledge — let the team know you're on it
- Assess severity — is this P1 (revenue impact) or P3 (cosmetic)?
- Check recent changes — deployments, config changes, DNS updates
- Check monitoring dashboards — what metrics are abnormal?
- Decide: fix or escalate — if you can't diagnose in 15 minutes, escalate
Communication During Incidents
For your team: - Update Slack every 15 minutes - Be specific: "Database CPU at 98%, investigating slow queries" not "Looking into it" - State your next action: "Going to kill the long-running query and monitor"
For customers: - Update status page immediately - Use clear, non-technical language - Provide an expected timeline - Follow up when resolved
Common Patterns and Quick Fixes
| Pattern | Likely Cause | First Action |
|---|---|---|
| Error spike after deployment | Bad deploy | Rollback |
| Gradual slowdown over hours | Memory leak / connection pool | Restart, then investigate |
| Sudden 100% CPU | Infinite loop, regex backtracking | Kill process, check logs |
| Database connection errors | Connection pool exhausted | Restart app, check for leaks |
| Intermittent 503s | Pod crashlooping | Check pod events, increase resources |
| Everything down at once | Infrastructure issue | Check cloud provider status |
When to Wake Someone Up
Yes, wake them up: - Revenue-impacting outage - Data loss or corruption risk - Security breach - You've been working on it for 30+ minutes alone - You need access you don't have
No, don't wake them up: - You can fix it yourself - It can wait until morning - It's a known issue with a workaround - It's a monitoring false positive
Building Good Runbooks
A runbook is a step-by-step guide for handling a specific alert. It should be written for 3 AM brain:
Runbook Template
# Alert: Database Connection Pool Exhausted
## Severity: P2
## Symptoms
- Application returns 503 errors intermittently
- Database connection count at maximum
- Logs show "connection pool exhausted" errors
## Impact
- ~30% of API requests failing
- Users may see errors on page load
## Quick Fix
1. Restart the application: `kubectl rollout restart deployment/api`
2. Verify connections drop: check Grafana dashboard "DB Connections"
3. Monitor for 15 minutes
## Root Cause Investigation (can wait until business hours)
1. Check for connection leaks: look for unclosed transactions
2. Review recent deploys: did connection pool config change?
3. Check database slow query log: long queries hold connections
4. Consider increasing pool size (current: 20, max recommended: 50)
## Escalation
If restarting doesn't help, page the database team (see escalation chain)
Key Principles
- No decisions at 3 AM — the runbook should tell you exactly what to do
- Include the "why" — understanding helps when the standard fix doesn't work
- Link to dashboards — don't make people search for the right graph
- Include rollback steps — for deployment-related issues
- Keep it updated — outdated runbooks are dangerous
Post-Incident Review
The Blameless Post-Mortem
Within 48 hours of a significant incident, hold a blameless post-mortem:
Structure: 1. Timeline — what happened, when, and what actions were taken 2. Impact — duration, affected users, revenue impact 3. Root cause — not "who", but "what system allowed this to happen" 4. What went well — detection, response, communication 5. What could be improved — gaps in monitoring, slow detection, unclear runbooks 6. Action items — specific, assigned, with deadlines
Blameless doesn't mean accountability-free. It means we focus on systemic improvements rather than individual blame. "The deployment pipeline should have caught this" not "John broke production."
Mental Health and On-Call
Recognizing Burnout Signs
- Dreading your on-call shift days in advance
- Anxiety about your phone ringing
- Sleep disruption even when not paged
- Resentment toward the team or company
- Decreased quality of work during business hours
Prevention
- Fair rotation — equal distribution of shifts
- Comp time — time off after heavy on-call weeks
- No-judgment swaps — life happens, let people trade shifts
- Regular retrospectives — is on-call getting better or worse?
- Invest in reliability — the best on-call shift is a quiet one
After a Bad Night
- Take the next morning off (or the whole day)
- Hand off to the secondary if you're exhausted
- Document what happened for the post-mortem
- Don't make architectural decisions while sleep-deprived
Conclusion
On-call is a shared responsibility that, when done well, makes your systems more reliable and your team more knowledgeable. The goal isn't zero incidents — it's fast detection, efficient response, and continuous improvement. Invest in monitoring, write good runbooks, reduce alert noise, and take care of your people. A sustainable on-call practice is one where engineers willingly participate because they know they'll be supported, compensated, and not burned out.