Kubernetes Health Checks: Liveness, Readiness, and Startup Probes Explained
Kubernetes health probes are one of the most powerful — and most misunderstood — features of the platform. Configured correctly, they make your application self-healing. Configured incorrectly, they cause cascading restarts, dropped requests, and mysterious downtime.
The Three Types of Probes
Liveness Probe: "Is the process stuck?"
The liveness probe answers: should Kubernetes restart this container?
If the liveness probe fails, Kubernetes kills the container and starts a new one. This is useful for detecting deadlocks, infinite loops, or corrupted state that can't recover without a restart.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
When the liveness probe fails: 1. Container is killed (SIGTERM → SIGKILL after grace period) 2. New container is started 3. If it keeps failing, Kubernetes backs off (CrashLoopBackOff)
Common mistake: Making the liveness probe check dependencies (database, external APIs). If your database is slow, the liveness probe fails, Kubernetes restarts your app, the app reconnects to the already-stressed database, making things worse. This is the #1 cause of cascading failures in Kubernetes.
Rule: Liveness probes should only check if the process itself is healthy, not its dependencies.
Readiness Probe: "Can this instance handle traffic?"
The readiness probe answers: should Kubernetes send traffic to this pod?
If the readiness probe fails, the pod is removed from the Service's endpoint list. It stays running — it just stops receiving new requests.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
When the readiness probe fails: 1. Pod is removed from Service endpoints 2. No new traffic is routed to this pod 3. Existing connections continue (graceful) 4. When probe passes again, pod is added back
This is where you check dependencies. If your database is down, the readiness probe should fail so traffic is routed to pods that can still serve (maybe from cache).
Startup Probe: "Has the app finished initializing?"
The startup probe answers: is the container still starting up?
Until the startup probe succeeds, liveness and readiness probes are disabled. This is crucial for applications with long startup times.
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 × 5s = 150s max startup time
Without a startup probe: If your app takes 60 seconds to start and your liveness probe has a 15-second initial delay, Kubernetes will kill the container before it finishes starting, causing a restart loop.
Designing Your Health Endpoints
The /healthz Endpoint (Liveness)
Should be simple and fast:
@app.get("/healthz")
async def liveness():
# Only check if the process is alive and responding
return {"status": "alive"}
Do NOT include: - Database checks - External API checks - Heavy computation - File system checks (might hang on NFS)
The /ready Endpoint (Readiness)
Should verify the application can serve requests:
@app.get("/ready")
async def readiness():
checks = {}
# Check database connectivity
try:
await db.execute("SELECT 1")
checks["database"] = "ok"
except Exception:
checks["database"] = "failed"
return JSONResponse({"status": "not ready", "checks": checks}, status_code=503)
# Check cache connectivity
try:
redis.ping()
checks["cache"] = "ok"
except Exception:
checks["cache"] = "failed"
return JSONResponse({"status": "not ready", "checks": checks}, status_code=503)
return {"status": "ready", "checks": checks}
Configuration Guidelines
Timing Parameters
| Parameter | Liveness | Readiness | Startup |
|---|---|---|---|
| initialDelaySeconds | App startup time | 0-5s | 0 |
| periodSeconds | 10-30s | 5-10s | 5-10s |
| timeoutSeconds | 3-5s | 3-5s | 3-5s |
| failureThreshold | 3-5 | 2-3 | 30+ |
| successThreshold | 1 | 1-2 | 1 |
Formulas
Liveness failure detection time:
Detection = initialDelay + (periodSeconds × failureThreshold)
Example: 15 + (10 × 3) = 45 seconds
Startup maximum time:
Max startup = periodSeconds × failureThreshold
Example: 5 × 30 = 150 seconds
Common Anti-Patterns
1. Liveness Probe Checks Database
# BAD: Database outage causes all pods to restart
livenessProbe:
httpGet:
path: /health # This endpoint checks DB
Fix: Use /healthz (process-only) for liveness, /ready (with DB check) for readiness.
2. Same Endpoint for Both Probes
# BAD: Same check for liveness and readiness
livenessProbe:
httpGet:
path: /health
readinessProbe:
httpGet:
path: /health
Fix: Separate endpoints with different checks.
3. Timeout Too Short
# BAD: Under load, 1s timeout will fail
livenessProbe:
httpGet:
path: /healthz
timeoutSeconds: 1
Fix: Set timeout to at least 3 seconds. Under load, even simple endpoints can be slow.
4. No Startup Probe for Slow Apps
# BAD: App needs 60s to start, liveness kills it at 25s
livenessProbe:
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
Fix: Add a startup probe with generous failureThreshold.
5. Failure Threshold Too Low
# BAD: Single failure triggers restart
livenessProbe:
failureThreshold: 1
Fix: Use failureThreshold of 3+ to tolerate transient issues.
Monitoring Your Probes
Set up external monitoring in addition to Kubernetes probes:
- External uptime monitoring — checks your service from outside the cluster
- Probe failure metrics — track how often probes fail (a leading indicator)
- Pod restart count — increasing restarts indicate probe misconfiguration
- CrashLoopBackOff alerts — something is fundamentally broken
Conclusion
Kubernetes probes are your application's immune system. Liveness is the restart button for stuck processes. Readiness is the traffic light for overloaded pods. Startup is the patience for slow initializers. Get them right, and your application becomes self-healing. Get them wrong, and they become the source of your outages.