The DevOps On-Call Survival Guide: Stay Sane While Keeping Systems Up

Being on-call is one of the most stressful parts of working in tech. Your phone can ring at 3 AM, your weekend plans are never certain, and the pressure to fix production issues quickly is intense. But it doesn't have to be miserable. Here's how to build an on-call culture that works.

Building a Sustainable On-Call Rotation

Team Size and Rotation Length

Team Size Recommended Rotation On-Call Frequency
2-3 people 1-week rotations Every 1-2 weeks
4-6 people 1-week rotations Every 4-6 weeks
7-10 people 1-week rotations Every 7-10 weeks
10+ people Consider sub-teams Team-specific rotation

Rule of thumb: No one should be on-call more than once every 4 weeks. More frequent than that leads to burnout.

Rotation Best Practices

  1. Overlap handoffs — 30-minute overlap between outgoing and incoming on-call
  2. Handoff document — what's currently broken, what's been flaky, what was deployed
  3. Shadow rotation — new team members shadow for 1-2 rotations before going solo
  4. Compensate fairly — on-call pay, comp time, or reduced hours
  5. No surprises — publish the schedule at least 2 months ahead
  6. Easy swaps — let people trade shifts without management approval

Escalation Chain

Level 1: Primary on-call (0-15 minutes)
    ↓ No response in 15 minutes
Level 2: Secondary on-call (15-30 minutes)
    ↓ No response in 15 minutes
Level 3: Team lead / Engineering manager
    ↓ P1 lasting > 1 hour
Level 4: VP Engineering / CTO

Reducing Alert Fatigue

Alert fatigue is the #1 reason on-call becomes unbearable. When every shift generates 20+ alerts, engineers start ignoring them — including the real ones.

The Alert Audit

Review every alert that fired in the last month:

Category Action
Actionable + Urgent Keep as-is
Actionable + Not Urgent Move to business hours
Not Actionable Delete the alert
Flapping (on/off/on/off) Fix root cause or add hysteresis
Duplicate Consolidate

Target: < 2 pages per on-call shift (outside business hours).

Alert Hygiene Rules

  1. Every alert must have a runbook — if you don't know what to do, the alert is useless
  2. Every alert must be actionable — if you can't do anything about it, it's not an alert
  3. Tune thresholds quarterly — as your system grows, thresholds need updating
  4. Use severity levels — not everything is P1
  5. Group related alerts — database slow + app timeout + error spike = one incident, not three

Smart Alert Routing

3 AM on Saturday:
  P1 (site down) → Phone call + SMS + Telegram
  P2 (degraded) → Telegram only
  P3 (minor issue) → Queue for Monday morning

2 PM on Tuesday:
  P1 → Slack + Phone call
  P2 → Slack channel
  P3 → Slack channel (no notification)

Handling an Incident

The First 5 Minutes

  1. Acknowledge — let the team know you're on it
  2. Assess severity — is this P1 (revenue impact) or P3 (cosmetic)?
  3. Check recent changes — deployments, config changes, DNS updates
  4. Check monitoring dashboards — what metrics are abnormal?
  5. Decide: fix or escalate — if you can't diagnose in 15 minutes, escalate

Communication During Incidents

For your team: - Update Slack every 15 minutes - Be specific: "Database CPU at 98%, investigating slow queries" not "Looking into it" - State your next action: "Going to kill the long-running query and monitor"

For customers: - Update status page immediately - Use clear, non-technical language - Provide an expected timeline - Follow up when resolved

Common Patterns and Quick Fixes

Pattern Likely Cause First Action
Error spike after deployment Bad deploy Rollback
Gradual slowdown over hours Memory leak / connection pool Restart, then investigate
Sudden 100% CPU Infinite loop, regex backtracking Kill process, check logs
Database connection errors Connection pool exhausted Restart app, check for leaks
Intermittent 503s Pod crashlooping Check pod events, increase resources
Everything down at once Infrastructure issue Check cloud provider status

When to Wake Someone Up

Yes, wake them up: - Revenue-impacting outage - Data loss or corruption risk - Security breach - You've been working on it for 30+ minutes alone - You need access you don't have

No, don't wake them up: - You can fix it yourself - It can wait until morning - It's a known issue with a workaround - It's a monitoring false positive

Building Good Runbooks

A runbook is a step-by-step guide for handling a specific alert. It should be written for 3 AM brain:

Runbook Template

# Alert: Database Connection Pool Exhausted

## Severity: P2

## Symptoms
- Application returns 503 errors intermittently
- Database connection count at maximum
- Logs show "connection pool exhausted" errors

## Impact
- ~30% of API requests failing
- Users may see errors on page load

## Quick Fix
1. Restart the application: `kubectl rollout restart deployment/api`
2. Verify connections drop: check Grafana dashboard "DB Connections"
3. Monitor for 15 minutes

## Root Cause Investigation (can wait until business hours)
1. Check for connection leaks: look for unclosed transactions
2. Review recent deploys: did connection pool config change?
3. Check database slow query log: long queries hold connections
4. Consider increasing pool size (current: 20, max recommended: 50)

## Escalation
If restarting doesn't help, page the database team (see escalation chain)

Key Principles

  • No decisions at 3 AM — the runbook should tell you exactly what to do
  • Include the "why" — understanding helps when the standard fix doesn't work
  • Link to dashboards — don't make people search for the right graph
  • Include rollback steps — for deployment-related issues
  • Keep it updated — outdated runbooks are dangerous

Post-Incident Review

The Blameless Post-Mortem

Within 48 hours of a significant incident, hold a blameless post-mortem:

Structure: 1. Timeline — what happened, when, and what actions were taken 2. Impact — duration, affected users, revenue impact 3. Root cause — not "who", but "what system allowed this to happen" 4. What went well — detection, response, communication 5. What could be improved — gaps in monitoring, slow detection, unclear runbooks 6. Action items — specific, assigned, with deadlines

Blameless doesn't mean accountability-free. It means we focus on systemic improvements rather than individual blame. "The deployment pipeline should have caught this" not "John broke production."

Mental Health and On-Call

Recognizing Burnout Signs

  • Dreading your on-call shift days in advance
  • Anxiety about your phone ringing
  • Sleep disruption even when not paged
  • Resentment toward the team or company
  • Decreased quality of work during business hours

Prevention

  • Fair rotation — equal distribution of shifts
  • Comp time — time off after heavy on-call weeks
  • No-judgment swaps — life happens, let people trade shifts
  • Regular retrospectives — is on-call getting better or worse?
  • Invest in reliability — the best on-call shift is a quiet one

After a Bad Night

  • Take the next morning off (or the whole day)
  • Hand off to the secondary if you're exhausted
  • Document what happened for the post-mortem
  • Don't make architectural decisions while sleep-deprived

Conclusion

On-call is a shared responsibility that, when done well, makes your systems more reliable and your team more knowledgeable. The goal isn't zero incidents — it's fast detection, efficient response, and continuous improvement. Invest in monitoring, write good runbooks, reduce alert noise, and take care of your people. A sustainable on-call practice is one where engineers willingly participate because they know they'll be supported, compensated, and not burned out.