Cron Job Monitoring: How to Prevent Silent Failures
Your daily database backup cron job stopped running 3 weeks ago. Nobody noticed until the server crashed and the latest backup was from last month. Sound familiar?
Cron jobs are the silent workhorses of infrastructure — and their failures are equally silent. Unlike a web server crash that triggers immediate alerts, a cron job that stops running simply... doesn't run. No error. No alert. Just missing data that you discover at the worst possible time.
Why Cron Jobs Fail Silently
Common Failure Modes
- Server restart — crontab lost or service not started
- Disk full — job starts, fails to write output, exits silently
- Permission changes — script can't access files/databases after an update
- Dependency missing — a library or binary was removed during an update
- Timeout — job takes longer than expected and gets killed
- OOM kill — job uses too much memory and the OS kills it
- Lock file stale — previous run left a lock file, new runs skip
- Environment mismatch — works interactively, fails in cron (different PATH, env vars)
- Certificate expired — job calls an API with expired SSL
- Rate limiting — external API rejects requests
Why Traditional Monitoring Misses Them
Uptime monitoring checks if your website responds. Health checks verify your application is running. But neither detects that a background job silently stopped executing at 2 AM.
How Heartbeat Monitoring Works
Instead of checking if something is up, heartbeat monitoring checks if something happened. The concept is simple:
- Your cron job sends a "ping" (HTTP request) when it runs successfully
- The monitoring system expects to receive this ping on a schedule
- If the ping doesn't arrive within the expected window, an alert fires
# Your cron job (before heartbeat monitoring)
0 * * * * /usr/local/bin/backup.sh
# Your cron job (with heartbeat monitoring)
0 * * * * /usr/local/bin/backup.sh && curl -s https://valpero.com/api/heartbeat/ping/abc123
That curl at the end is the heartbeat. If the backup script fails (non-zero exit code), the && prevents the curl from running, and the monitoring system alerts you.
What to Monitor
Critical Cron Jobs
| Job | Typical Schedule | Impact of Failure |
|---|---|---|
| Database backups | Hourly / Daily | Data loss risk |
| SSL certificate renewal | Daily check | Site outage |
| Log rotation | Daily | Disk full → crash |
| Report generation | Daily / Weekly | Business impact |
| Data sync / ETL | Hourly | Stale data |
| Cleanup tasks | Daily | Disk/DB bloat |
| Health checks | Every minute | Missed outages |
| Queue processing | Continuous | Backlog growth |
| Payment reconciliation | Daily | Financial discrepancy |
| Email digest sending | Daily | User engagement drop |
What to Include in the Heartbeat Ping
Don't just ping — include useful context:
# Basic: just ping on success
curl -s https://valpero.com/api/heartbeat/ping/abc123
# Better: include execution time and status
curl -s "https://valpero.com/api/heartbeat/ping/abc123?duration=${SECONDS}&status=ok"
# Best: send execution details
curl -s -X POST https://valpero.com/api/heartbeat/ping/abc123 -H "Content-Type: application/json" -d "{"duration": ${SECONDS}, "records_processed": ${COUNT}}"
Setting Up Heartbeat Monitoring
Step 1: Create a Heartbeat Monitor
Set up a monitor with: - Name: "Production DB Backup" (human-readable) - Expected interval: how often the job should run (e.g., every 1 hour) - Grace period: how much delay is acceptable before alerting (e.g., 10 minutes)
Step 2: Add the Ping to Your Job
Add a curl/wget call at the end of your cron job, after the main task succeeds.
Step 3: Handle Failures Properly
#!/bin/bash
set -e # Exit on any error
# Your actual job
pg_dump mydb > /backups/daily.sql
gzip /backups/daily.sql
# Only ping if everything succeeded
curl -fsS --retry 3 https://valpero.com/api/heartbeat/ping/abc123
Step 4: Configure Alerts
Set up alerts on the appropriate channels: - Critical jobs (backups, payments): SMS + Telegram - Important jobs (reports, syncs): Slack/Email - Nice-to-have jobs (cleanup): Email only
Advanced Patterns
Wrapper Script
Create a reusable wrapper that handles logging, error capture, and heartbeat pinging:
#!/bin/bash
# heartbeat-wrapper.sh <heartbeat-id> <command...>
HEARTBEAT_ID=$1
shift
COMMAND="$@"
START=$(date +%s)
if $COMMAND 2>&1 | tee /var/log/cron-${HEARTBEAT_ID}.log; then
DURATION=$(($(date +%s) - START))
curl -fsS --retry 3 "https://valpero.com/api/heartbeat/ping/${HEARTBEAT_ID}?duration=${DURATION}"
else
EXIT_CODE=$?
echo "Job failed with exit code ${EXIT_CODE}" >> /var/log/cron-${HEARTBEAT_ID}.log
# Don't ping — monitoring will alert on missing heartbeat
fi
Usage:
0 * * * * /usr/local/bin/heartbeat-wrapper.sh abc123 /usr/local/bin/backup.sh
Start + End Pinging
For long-running jobs, ping at both the start and end:
# Ping "start" — we know the job attempted to run
curl -s "https://valpero.com/api/heartbeat/ping/abc123?status=start"
# Run the actual job
/usr/local/bin/heavy-etl-job.sh
# Ping "end" — we know the job completed
curl -s "https://valpero.com/api/heartbeat/ping/abc123?status=complete"
This lets you detect jobs that started but never finished (hung, killed, stuck).
Common Mistakes
- Pinging before the job runs — you'll never know if it actually succeeded
- No grace period — jobs that take variable time trigger false alerts
- Ignoring exit codes —
job.sh; curl pingpings even on failure; use&& - Not monitoring the monitor — if your heartbeat endpoint is down, all jobs appear failed
- Too many heartbeats — monitor important jobs, not every tiny script
Conclusion
Every cron job that matters should have a heartbeat monitor. It takes 60 seconds to set up and prevents the most frustrating type of failure — the one you don't discover until it's too late. If a job is important enough to schedule, it's important enough to monitor.