The DevOps On-Call Survival Guide: Stay Sane While Keeping Systems Up

Being on-call is one of the most stressful parts of working in tech. Your phone can ring at 3 AM, your weekend plans are never certain, and the pressure to fix production issues quickly is intense. But it doesn't have to be miserable. Here's how to build an on-call culture that works.

Building a Sustainable On-Call Rotation

Team Size and Rotation Length

Team Size	Recommended Rotation	On-Call Frequency
2-3 people	1-week rotations	Every 1-2 weeks
4-6 people	1-week rotations	Every 4-6 weeks
7-10 people	1-week rotations	Every 7-10 weeks
10+ people	Consider sub-teams	Team-specific rotation

Rule of thumb: No one should be on-call more than once every 4 weeks. More frequent than that leads to burnout.

Rotation Best Practices

Overlap handoffs — 30-minute overlap between outgoing and incoming on-call
Handoff document — what's currently broken, what's been flaky, what was deployed
Shadow rotation — new team members shadow for 1-2 rotations before going solo
Compensate fairly — on-call pay, comp time, or reduced hours
No surprises — publish the schedule at least 2 months ahead
Easy swaps — let people trade shifts without management approval

Escalation Chain

Level 1: Primary on-call (0-15 minutes)
    ↓ No response in 15 minutes
Level 2: Secondary on-call (15-30 minutes)
    ↓ No response in 15 minutes
Level 3: Team lead / Engineering manager
    ↓ P1 lasting > 1 hour
Level 4: VP Engineering / CTO

Reducing Alert Fatigue

Alert fatigue is the #1 reason on-call becomes unbearable. When every shift generates 20+ alerts, engineers start ignoring them — including the real ones.

The Alert Audit

Review every alert that fired in the last month:

Category	Action
Actionable + Urgent	Keep as-is
Actionable + Not Urgent	Move to business hours
Not Actionable	Delete the alert
Flapping (on/off/on/off)	Fix root cause or add hysteresis
Duplicate	Consolidate

Target: < 2 pages per on-call shift (outside business hours).

Alert Hygiene Rules

Every alert must have a runbook — if you don't know what to do, the alert is useless
Every alert must be actionable — if you can't do anything about it, it's not an alert
Tune thresholds quarterly — as your system grows, thresholds need updating
Use severity levels — not everything is P1
Group related alerts — database slow + app timeout + error spike = one incident, not three

Smart Alert Routing

3 AM on Saturday:
  P1 (site down) → Phone call + SMS + Telegram
  P2 (degraded) → Telegram only
  P3 (minor issue) → Queue for Monday morning

2 PM on Tuesday:
  P1 → Slack + Phone call
  P2 → Slack channel
  P3 → Slack channel (no notification)

Handling an Incident

The First 5 Minutes

Acknowledge — let the team know you're on it
Assess severity — is this P1 (revenue impact) or P3 (cosmetic)?
Check recent changes — deployments, config changes, DNS updates
Check monitoring dashboards — what metrics are abnormal?
Decide: fix or escalate — if you can't diagnose in 15 minutes, escalate

Communication During Incidents

For your team: - Update Slack every 15 minutes - Be specific: "Database CPU at 98%, investigating slow queries" not "Looking into it" - State your next action: "Going to kill the long-running query and monitor"

For customers: - Update status page immediately - Use clear, non-technical language - Provide an expected timeline - Follow up when resolved

Common Patterns and Quick Fixes

Pattern	Likely Cause	First Action
Error spike after deployment	Bad deploy	Rollback
Gradual slowdown over hours	Memory leak / connection pool	Restart, then investigate
Sudden 100% CPU	Infinite loop, regex backtracking	Kill process, check logs
Database connection errors	Connection pool exhausted	Restart app, check for leaks
Intermittent 503s	Pod crashlooping	Check pod events, increase resources
Everything down at once	Infrastructure issue	Check cloud provider status

When to Wake Someone Up

Yes, wake them up: - Revenue-impacting outage - Data loss or corruption risk - Security breach - You've been working on it for 30+ minutes alone - You need access you don't have

No, don't wake them up: - You can fix it yourself - It can wait until morning - It's a known issue with a workaround - It's a monitoring false positive

Building Good Runbooks

A runbook is a step-by-step guide for handling a specific alert. It should be written for 3 AM brain:

Runbook Template

# Alert: Database Connection Pool Exhausted

## Severity: P2

## Symptoms
- Application returns 503 errors intermittently
- Database connection count at maximum
- Logs show "connection pool exhausted" errors

## Impact
- ~30% of API requests failing
- Users may see errors on page load

## Quick Fix
1. Restart the application: `kubectl rollout restart deployment/api`
2. Verify connections drop: check Grafana dashboard "DB Connections"
3. Monitor for 15 minutes

## Root Cause Investigation (can wait until business hours)
1. Check for connection leaks: look for unclosed transactions
2. Review recent deploys: did connection pool config change?
3. Check database slow query log: long queries hold connections
4. Consider increasing pool size (current: 20, max recommended: 50)

## Escalation
If restarting doesn't help, page the database team (see escalation chain)

Key Principles

No decisions at 3 AM — the runbook should tell you exactly what to do
Include the "why" — understanding helps when the standard fix doesn't work
Link to dashboards — don't make people search for the right graph
Include rollback steps — for deployment-related issues
Keep it updated — outdated runbooks are dangerous

Post-Incident Review

The Blameless Post-Mortem

Within 48 hours of a significant incident, hold a blameless post-mortem:

Structure: 1. Timeline — what happened, when, and what actions were taken 2. Impact — duration, affected users, revenue impact 3. Root cause — not "who", but "what system allowed this to happen" 4. What went well — detection, response, communication 5. What could be improved — gaps in monitoring, slow detection, unclear runbooks 6. Action items — specific, assigned, with deadlines

Blameless doesn't mean accountability-free. It means we focus on systemic improvements rather than individual blame. "The deployment pipeline should have caught this" not "John broke production."

Mental Health and On-Call

Recognizing Burnout Signs

Dreading your on-call shift days in advance
Anxiety about your phone ringing
Sleep disruption even when not paged
Resentment toward the team or company
Decreased quality of work during business hours

Prevention

Fair rotation — equal distribution of shifts
Comp time — time off after heavy on-call weeks
No-judgment swaps — life happens, let people trade shifts
Regular retrospectives — is on-call getting better or worse?
Invest in reliability — the best on-call shift is a quiet one

After a Bad Night

Take the next morning off (or the whole day)
Hand off to the secondary if you're exhausted
Document what happened for the post-mortem
Don't make architectural decisions while sleep-deprived

Conclusion

On-call is a shared responsibility that, when done well, makes your systems more reliable and your team more knowledgeable. The goal isn't zero incidents — it's fast detection, efficient response, and continuous improvement. Invest in monitoring, write good runbooks, reduce alert noise, and take care of your people. A sustainable on-call practice is one where engineers willingly participate because they know they'll be supported, compensated, and not burned out.

The DevOps On-Call Survival Guide: Stay Sane While Keeping Systems Up

The DevOps On-Call Survival Guide: Stay Sane While Keeping Systems Up

Building a Sustainable On-Call Rotation

Team Size and Rotation Length

Rotation Best Practices

Escalation Chain

Reducing Alert Fatigue

The Alert Audit

Alert Hygiene Rules

Smart Alert Routing

Handling an Incident

The First 5 Minutes

Communication During Incidents

Common Patterns and Quick Fixes

When to Wake Someone Up

Building Good Runbooks

Runbook Template

Key Principles

Post-Incident Review

The Blameless Post-Mortem

Mental Health and On-Call

Recognizing Burnout Signs

Prevention

After a Bad Night

Conclusion

Start monitoring your services for free