The Complete Guide to Monitoring Microservices in Production

Microservices solved the monolith scaling problem — and created a monitoring nightmare. Instead of one application to watch, you now have dozens of services, each with its own database, queue, and failure modes. Here's how to get observability right.

Why Microservice Monitoring Is Different

In a monolith, a failing component usually takes down the whole application. It's obvious something is wrong. In a microservices architecture, failures are subtle:

Cascading failures — Service A slows down, causing Service B to timeout, causing Service C to queue up, causing the user to see a spinner
Partial degradation — The checkout works but recommendations are broken. Is that an incident?
Network complexity — Services communicate over the network, adding latency, retry storms, and connection pool exhaustion
Distributed state — Data is spread across multiple databases, making consistency issues hard to diagnose

The Four Pillars of Microservice Observability

1. Health Checks

Every service should expose a health endpoint that your monitoring system can check:

GET /health
{
  "status": "healthy",
  "version": "2.4.1",
  "uptime": "48h32m",
  "dependencies": {
    "database": "ok",
    "redis": "ok",
    "payment-service": "ok"
  }
}

Best practices: - Check downstream dependencies, not just the service itself - Include version info for deployment tracking - Use three states: healthy, degraded, unhealthy - Set appropriate timeouts (health check shouldn't take > 5 seconds) - Don't cache health check responses

2. Metrics

Collect quantitative data about your service behavior:

The RED Method (for request-driven services): - Rate — requests per second - Errors — failed requests per second - Duration — response time distribution (p50, p95, p99)

The USE Method (for infrastructure): - Utilization — how busy is the resource (CPU, memory, disk) - Saturation — how much work is queued - Errors — error count

Key metrics to track per service: | Metric | Why It Matters | |--------|---------------| | Request rate | Traffic patterns, capacity planning | | Error rate | Service health, deployment impact | | p99 latency | Worst-case user experience | | Connection pool usage | Resource exhaustion risk | | Queue depth | Backpressure, processing lag | | Circuit breaker state | Downstream dependency health | | Memory / CPU usage | Resource constraints | | GC pause time | JVM/Go runtime health |

3. Distributed Tracing

When a user request touches 5 services, you need to follow that request across the entire chain:

User Request → API Gateway → Auth Service → Product Service → Inventory DB → Response
                                    ↓
                              Audit Logger

A trace ID propagated through all services lets you: - See the complete request flow - Identify which service added latency - Find the root cause of errors in complex call chains - Understand dependency relationships

Implementation: - Use OpenTelemetry (vendor-neutral standard) - Propagate trace context via HTTP headers (traceparent) - Sample traces at 1–10% in production (100% is too expensive) - Store traces for at least 7 days

4. Structured Logging

Logs are your last resort when metrics and traces don't explain the problem:

{
  "timestamp": "2026-04-07T14:32:01Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "message": "Payment declined",
  "user_id": "u_456",
  "amount": 99.99,
  "error_code": "INSUFFICIENT_FUNDS",
  "duration_ms": 234
}

Best practices: - Use structured (JSON) logging, not free-text - Include trace ID in every log entry - Log at the right level (don't log every successful request at INFO) - Centralize logs (ELK stack, Loki, CloudWatch) - Set retention policies (7 days hot, 30 days warm, 90 days cold)

Alert Strategy for Microservices

The Problem with Naive Alerting

If you alert on every service independently, a single downstream failure triggers alerts from every upstream service. Your on-call engineer gets 47 alerts instead of 1.

Better Approach: Symptom-Based Alerting

Alert on user-facing symptoms, not individual service metrics:

Alert On	Instead Of
Checkout error rate > 1%	Payment service p99 > 2s
Search results empty for > 5 min	Elasticsearch cluster yellow
Login success rate < 95%	Auth service CPU > 80%
API p99 > 3s for 5 minutes	Any individual service slow

Alert Levels

Level	Criteria	Action
P1 Critical	Revenue-impacting, data loss risk	Page on-call immediately
P2 High	User-facing degradation	Notify team, fix within 1 hour
P3 Medium	Internal tooling, non-critical	Fix during business hours
P4 Low	Cosmetic, tech debt	Add to backlog

Reducing Alert Fatigue

Group related alerts — one incident, one notification
Use alert cooldowns — don't re-alert for the same issue within 15 minutes
Require confirmation from multiple regions — eliminate network-related false positives
Review alerts monthly — delete alerts nobody acts on

Service Dependency Mapping

You can't monitor what you don't understand. Map your service dependencies:

┌─────────┐     ┌──────────┐     ┌───────────┐
│   API   │────→│  Auth    │────→│  User DB  │
│ Gateway │     │ Service  │     └───────────┘
└────┬────┘     └──────────┘
     │
     ├──────→ Product Service ──→ Product DB
     │              │
     │              └──→ Search (Elasticsearch)
     │
     └──────→ Order Service ──→ Order DB
                    │
                    ├──→ Payment Service ──→ Stripe API
                    └──→ Notification Service ──→ Email/SMS

For each dependency, document: - Is it critical (service can't function without it) or degradable (can fall back)? - What's the timeout and retry policy? - Is there a circuit breaker? - What's the SLA?

Deployment Monitoring

Every deployment is a potential incident. Monitor deployments with:

Canary deployments — route 5% of traffic to the new version, compare error rates
Deployment markers — annotate your metrics dashboard with deployment timestamps
Automatic rollback — if error rate increases > 2x within 10 minutes, roll back
Health check gates — new instances must pass health checks before receiving traffic

Recommended Monitoring Stack

Layer	Tool	Purpose
Uptime	Valpero	External availability + SSL + multi-region
Metrics	Prometheus + Grafana	Internal service metrics
Tracing	Jaeger / Tempo	Distributed request tracing
Logging	Loki / ELK	Centralized structured logs
Alerting	Valpero + Alertmanager	Symptom-based alerts
Status Page	Valpero	Customer-facing incident communication

Common Pitfalls

Monitoring too much — dashboards with 50 panels that nobody looks at
Monitoring too little — "it was fine in staging" syndrome
No runbooks — alerts fire but nobody knows what to do
Ignoring dependencies — your service is healthy but the upstream API is down
No SLOs — without a target, you can't measure if monitoring is working

Conclusion

Microservice monitoring isn't about watching individual services — it's about understanding the system as a whole. Start with health checks and external monitoring, add metrics and alerting as you grow, and invest in tracing when debugging becomes painful. The goal is always the same: know about problems before your users do.

The Complete Guide to Monitoring Microservices in Production

The Complete Guide to Monitoring Microservices in Production

Why Microservice Monitoring Is Different

The Four Pillars of Microservice Observability

1. Health Checks

2. Metrics

3. Distributed Tracing

4. Structured Logging

Alert Strategy for Microservices

The Problem with Naive Alerting

Better Approach: Symptom-Based Alerting

Alert Levels

Reducing Alert Fatigue

Service Dependency Mapping

Deployment Monitoring

Recommended Monitoring Stack

Common Pitfalls

Conclusion

Start monitoring your services for free