The Complete Guide to Monitoring Microservices in Production
Microservices solved the monolith scaling problem — and created a monitoring nightmare. Instead of one application to watch, you now have dozens of services, each with its own database, queue, and failure modes. Here's how to get observability right.
Why Microservice Monitoring Is Different
In a monolith, a failing component usually takes down the whole application. It's obvious something is wrong. In a microservices architecture, failures are subtle:
- Cascading failures — Service A slows down, causing Service B to timeout, causing Service C to queue up, causing the user to see a spinner
- Partial degradation — The checkout works but recommendations are broken. Is that an incident?
- Network complexity — Services communicate over the network, adding latency, retry storms, and connection pool exhaustion
- Distributed state — Data is spread across multiple databases, making consistency issues hard to diagnose
The Four Pillars of Microservice Observability
1. Health Checks
Every service should expose a health endpoint that your monitoring system can check:
GET /health
{
"status": "healthy",
"version": "2.4.1",
"uptime": "48h32m",
"dependencies": {
"database": "ok",
"redis": "ok",
"payment-service": "ok"
}
}
Best practices:
- Check downstream dependencies, not just the service itself
- Include version info for deployment tracking
- Use three states: healthy, degraded, unhealthy
- Set appropriate timeouts (health check shouldn't take > 5 seconds)
- Don't cache health check responses
2. Metrics
Collect quantitative data about your service behavior:
The RED Method (for request-driven services): - Rate — requests per second - Errors — failed requests per second - Duration — response time distribution (p50, p95, p99)
The USE Method (for infrastructure): - Utilization — how busy is the resource (CPU, memory, disk) - Saturation — how much work is queued - Errors — error count
Key metrics to track per service: | Metric | Why It Matters | |--------|---------------| | Request rate | Traffic patterns, capacity planning | | Error rate | Service health, deployment impact | | p99 latency | Worst-case user experience | | Connection pool usage | Resource exhaustion risk | | Queue depth | Backpressure, processing lag | | Circuit breaker state | Downstream dependency health | | Memory / CPU usage | Resource constraints | | GC pause time | JVM/Go runtime health |
3. Distributed Tracing
When a user request touches 5 services, you need to follow that request across the entire chain:
User Request → API Gateway → Auth Service → Product Service → Inventory DB → Response
↓
Audit Logger
A trace ID propagated through all services lets you: - See the complete request flow - Identify which service added latency - Find the root cause of errors in complex call chains - Understand dependency relationships
Implementation:
- Use OpenTelemetry (vendor-neutral standard)
- Propagate trace context via HTTP headers (traceparent)
- Sample traces at 1–10% in production (100% is too expensive)
- Store traces for at least 7 days
4. Structured Logging
Logs are your last resort when metrics and traces don't explain the problem:
{
"timestamp": "2026-04-07T14:32:01Z",
"level": "error",
"service": "payment-service",
"trace_id": "abc123",
"message": "Payment declined",
"user_id": "u_456",
"amount": 99.99,
"error_code": "INSUFFICIENT_FUNDS",
"duration_ms": 234
}
Best practices: - Use structured (JSON) logging, not free-text - Include trace ID in every log entry - Log at the right level (don't log every successful request at INFO) - Centralize logs (ELK stack, Loki, CloudWatch) - Set retention policies (7 days hot, 30 days warm, 90 days cold)
Alert Strategy for Microservices
The Problem with Naive Alerting
If you alert on every service independently, a single downstream failure triggers alerts from every upstream service. Your on-call engineer gets 47 alerts instead of 1.
Better Approach: Symptom-Based Alerting
Alert on user-facing symptoms, not individual service metrics:
| Alert On | Instead Of |
|---|---|
| Checkout error rate > 1% | Payment service p99 > 2s |
| Search results empty for > 5 min | Elasticsearch cluster yellow |
| Login success rate < 95% | Auth service CPU > 80% |
| API p99 > 3s for 5 minutes | Any individual service slow |
Alert Levels
| Level | Criteria | Action |
|---|---|---|
| P1 Critical | Revenue-impacting, data loss risk | Page on-call immediately |
| P2 High | User-facing degradation | Notify team, fix within 1 hour |
| P3 Medium | Internal tooling, non-critical | Fix during business hours |
| P4 Low | Cosmetic, tech debt | Add to backlog |
Reducing Alert Fatigue
- Group related alerts — one incident, one notification
- Use alert cooldowns — don't re-alert for the same issue within 15 minutes
- Require confirmation from multiple regions — eliminate network-related false positives
- Review alerts monthly — delete alerts nobody acts on
Service Dependency Mapping
You can't monitor what you don't understand. Map your service dependencies:
┌─────────┐ ┌──────────┐ ┌───────────┐
│ API │────→│ Auth │────→│ User DB │
│ Gateway │ │ Service │ └───────────┘
└────┬────┘ └──────────┘
│
├──────→ Product Service ──→ Product DB
│ │
│ └──→ Search (Elasticsearch)
│
└──────→ Order Service ──→ Order DB
│
├──→ Payment Service ──→ Stripe API
└──→ Notification Service ──→ Email/SMS
For each dependency, document: - Is it critical (service can't function without it) or degradable (can fall back)? - What's the timeout and retry policy? - Is there a circuit breaker? - What's the SLA?
Deployment Monitoring
Every deployment is a potential incident. Monitor deployments with:
- Canary deployments — route 5% of traffic to the new version, compare error rates
- Deployment markers — annotate your metrics dashboard with deployment timestamps
- Automatic rollback — if error rate increases > 2x within 10 minutes, roll back
- Health check gates — new instances must pass health checks before receiving traffic
Recommended Monitoring Stack
| Layer | Tool | Purpose |
|---|---|---|
| Uptime | Valpero | External availability + SSL + multi-region |
| Metrics | Prometheus + Grafana | Internal service metrics |
| Tracing | Jaeger / Tempo | Distributed request tracing |
| Logging | Loki / ELK | Centralized structured logs |
| Alerting | Valpero + Alertmanager | Symptom-based alerts |
| Status Page | Valpero | Customer-facing incident communication |
Common Pitfalls
- Monitoring too much — dashboards with 50 panels that nobody looks at
- Monitoring too little — "it was fine in staging" syndrome
- No runbooks — alerts fire but nobody knows what to do
- Ignoring dependencies — your service is healthy but the upstream API is down
- No SLOs — without a target, you can't measure if monitoring is working
Conclusion
Microservice monitoring isn't about watching individual services — it's about understanding the system as a whole. Start with health checks and external monitoring, add metrics and alerting as you grow, and invest in tracing when debugging becomes painful. The goal is always the same: know about problems before your users do.