The Complete Guide to Monitoring Microservices in Production

Microservices solved the monolith scaling problem — and created a monitoring nightmare. Instead of one application to watch, you now have dozens of services, each with its own database, queue, and failure modes. Here's how to get observability right.

Why Microservice Monitoring Is Different

In a monolith, a failing component usually takes down the whole application. It's obvious something is wrong. In a microservices architecture, failures are subtle:

  • Cascading failures — Service A slows down, causing Service B to timeout, causing Service C to queue up, causing the user to see a spinner
  • Partial degradation — The checkout works but recommendations are broken. Is that an incident?
  • Network complexity — Services communicate over the network, adding latency, retry storms, and connection pool exhaustion
  • Distributed state — Data is spread across multiple databases, making consistency issues hard to diagnose

The Four Pillars of Microservice Observability

1. Health Checks

Every service should expose a health endpoint that your monitoring system can check:

GET /health
{
  "status": "healthy",
  "version": "2.4.1",
  "uptime": "48h32m",
  "dependencies": {
    "database": "ok",
    "redis": "ok",
    "payment-service": "ok"
  }
}

Best practices: - Check downstream dependencies, not just the service itself - Include version info for deployment tracking - Use three states: healthy, degraded, unhealthy - Set appropriate timeouts (health check shouldn't take > 5 seconds) - Don't cache health check responses

2. Metrics

Collect quantitative data about your service behavior:

The RED Method (for request-driven services): - Rate — requests per second - Errors — failed requests per second - Duration — response time distribution (p50, p95, p99)

The USE Method (for infrastructure): - Utilization — how busy is the resource (CPU, memory, disk) - Saturation — how much work is queued - Errors — error count

Key metrics to track per service: | Metric | Why It Matters | |--------|---------------| | Request rate | Traffic patterns, capacity planning | | Error rate | Service health, deployment impact | | p99 latency | Worst-case user experience | | Connection pool usage | Resource exhaustion risk | | Queue depth | Backpressure, processing lag | | Circuit breaker state | Downstream dependency health | | Memory / CPU usage | Resource constraints | | GC pause time | JVM/Go runtime health |

3. Distributed Tracing

When a user request touches 5 services, you need to follow that request across the entire chain:

User Request → API Gateway → Auth Service → Product Service → Inventory DB → Response
                                    ↓
                              Audit Logger

A trace ID propagated through all services lets you: - See the complete request flow - Identify which service added latency - Find the root cause of errors in complex call chains - Understand dependency relationships

Implementation: - Use OpenTelemetry (vendor-neutral standard) - Propagate trace context via HTTP headers (traceparent) - Sample traces at 1–10% in production (100% is too expensive) - Store traces for at least 7 days

4. Structured Logging

Logs are your last resort when metrics and traces don't explain the problem:

{
  "timestamp": "2026-04-07T14:32:01Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "message": "Payment declined",
  "user_id": "u_456",
  "amount": 99.99,
  "error_code": "INSUFFICIENT_FUNDS",
  "duration_ms": 234
}

Best practices: - Use structured (JSON) logging, not free-text - Include trace ID in every log entry - Log at the right level (don't log every successful request at INFO) - Centralize logs (ELK stack, Loki, CloudWatch) - Set retention policies (7 days hot, 30 days warm, 90 days cold)

Alert Strategy for Microservices

The Problem with Naive Alerting

If you alert on every service independently, a single downstream failure triggers alerts from every upstream service. Your on-call engineer gets 47 alerts instead of 1.

Better Approach: Symptom-Based Alerting

Alert on user-facing symptoms, not individual service metrics:

Alert On Instead Of
Checkout error rate > 1% Payment service p99 > 2s
Search results empty for > 5 min Elasticsearch cluster yellow
Login success rate < 95% Auth service CPU > 80%
API p99 > 3s for 5 minutes Any individual service slow

Alert Levels

Level Criteria Action
P1 Critical Revenue-impacting, data loss risk Page on-call immediately
P2 High User-facing degradation Notify team, fix within 1 hour
P3 Medium Internal tooling, non-critical Fix during business hours
P4 Low Cosmetic, tech debt Add to backlog

Reducing Alert Fatigue

  • Group related alerts — one incident, one notification
  • Use alert cooldowns — don't re-alert for the same issue within 15 minutes
  • Require confirmation from multiple regions — eliminate network-related false positives
  • Review alerts monthly — delete alerts nobody acts on

Service Dependency Mapping

You can't monitor what you don't understand. Map your service dependencies:

┌─────────┐     ┌──────────┐     ┌───────────┐
│   API   │────→│  Auth    │────→│  User DB  │
│ Gateway │     │ Service  │     └───────────┘
└────┬────┘     └──────────┘
     │
     ├──────→ Product Service ──→ Product DB
     │              │
     │              └──→ Search (Elasticsearch)
     │
     └──────→ Order Service ──→ Order DB
                    │
                    ├──→ Payment Service ──→ Stripe API
                    └──→ Notification Service ──→ Email/SMS

For each dependency, document: - Is it critical (service can't function without it) or degradable (can fall back)? - What's the timeout and retry policy? - Is there a circuit breaker? - What's the SLA?

Deployment Monitoring

Every deployment is a potential incident. Monitor deployments with:

  1. Canary deployments — route 5% of traffic to the new version, compare error rates
  2. Deployment markers — annotate your metrics dashboard with deployment timestamps
  3. Automatic rollback — if error rate increases > 2x within 10 minutes, roll back
  4. Health check gates — new instances must pass health checks before receiving traffic
Layer Tool Purpose
Uptime Valpero External availability + SSL + multi-region
Metrics Prometheus + Grafana Internal service metrics
Tracing Jaeger / Tempo Distributed request tracing
Logging Loki / ELK Centralized structured logs
Alerting Valpero + Alertmanager Symptom-based alerts
Status Page Valpero Customer-facing incident communication

Common Pitfalls

  1. Monitoring too much — dashboards with 50 panels that nobody looks at
  2. Monitoring too little — "it was fine in staging" syndrome
  3. No runbooks — alerts fire but nobody knows what to do
  4. Ignoring dependencies — your service is healthy but the upstream API is down
  5. No SLOs — without a target, you can't measure if monitoring is working

Conclusion

Microservice monitoring isn't about watching individual services — it's about understanding the system as a whole. Start with health checks and external monitoring, add metrics and alerting as you grow, and invest in tracing when debugging becomes painful. The goal is always the same: know about problems before your users do.