How to Monitor Your SaaS Application: A Complete Guide

Running a SaaS application means your users expect 24/7 availability. Unlike a downloadable app that runs on the user's machine, every bug, every slow query, and every infrastructure hiccup directly affects your customers — and your revenue. Here's how to build monitoring that keeps your SaaS reliable.

The SaaS Monitoring Stack

A production SaaS application needs monitoring at five layers:

Layer 1: External Availability (User Perspective)

This is where you start. Can your users reach your application?

What to monitor: - Main application URL (login page, dashboard) - API endpoints (the ones your SPA or mobile app calls) - Public pages (marketing site, docs, blog) - CDN-served assets (JS bundles, images) - Third-party dependencies (payment gateway, auth provider)

How to monitor: - External HTTP checks every 1-5 minutes - From multiple geographic regions (your users aren't all in one city) - With keyword validation (catch error pages that return 200) - SSL certificate expiry monitoring

Alert on: Downtime from 2+ regions for 2+ consecutive checks.

Layer 2: Application Performance (User Experience)

Your app might be "up" but painfully slow. Users don't distinguish between "down" and "too slow to use."

What to monitor: - Response time (p50, p95, p99) per endpoint - Error rate (5xx responses / total responses) - Core Web Vitals (LCP, CLS, INP) - API-specific metrics (authentication success rate, search latency) - Background job processing time and queue depth

How to monitor: - Application Performance Monitoring (APM) instrumentation - Server-side request logging with timing - Client-side Real User Monitoring (RUM) via JS snippet - Synthetic monitoring for critical user journeys

Alert on: p99 > 3 seconds for 5 minutes, error rate > 1%.

Layer 3: Infrastructure (System Resources)

The servers, databases, and services running your app.

What to monitor: - CPU usage (sustained > 80% = problem) - Memory usage (approaching limit = OOM risk) - Disk usage (> 85% = time to clean up or scale) - Network I/O (bandwidth saturation) - Database connections (pool exhaustion) - Database query performance (slow query log) - Cache hit rate (Redis/Memcached) - Queue depth and processing rate

How to monitor: - Server monitoring agent (installed on each instance) - Database-specific monitoring (pg_stat_statements, slow query log) - Cloud provider metrics (CloudWatch, GCP Monitoring)

Alert on: Resource usage > 80% sustained for 10+ minutes.

Layer 4: Business Metrics (Revenue Impact)

Technical metrics don't tell you if the business is working.

What to monitor: - Signup conversion rate (landing page → registered) - Activation rate (registered → first meaningful action) - Payment success rate (attempted → completed) - Churn indicators (login frequency dropping) - Feature usage (are new features being adopted?)

How to monitor: - Analytics (Mixpanel, PostHog, or custom events) - Database queries on subscription/payment tables - Funnel tracking in your product

Alert on: Payment success rate drops below 95%, signup rate drops > 50% from baseline.

Layer 5: Security Monitoring

SaaS applications are high-value targets.

What to monitor: - Failed login attempts (brute force detection) - Unusual API usage patterns (rate limit violations) - SSL certificate changes (detect hijacking) - Dependency vulnerabilities (npm audit, pip-audit) - DNS record changes (unauthorized modifications)

How to monitor: - Application-level rate limiting with logging - DNS monitoring for record changes - Safe Browsing status checks - Regular dependency audits

Alert on: 100+ failed logins from one IP, DNS record change, SSL certificate mismatch.

Setting SLOs for Your SaaS

What Is an SLO?

A Service Level Objective is your internal target for reliability. It's more aggressive than your SLA (the promise to customers).

Metric SLO SLA
Availability 99.95% 99.9%
API latency (p95) < 500ms < 2 seconds
API error rate < 0.1% < 1%

How to Calculate Error Budget

Error budget = 1 - SLO

If SLO = 99.95%:
Error budget = 0.05% = 21.6 minutes/month

You can "spend" 21.6 minutes of downtime per month
before you breach your SLO.

When your error budget is running low: - Freeze non-critical deployments - Focus engineering effort on reliability - Postpone risky changes

Monitoring Your Tech Stack

Frontend (React, Vue, Next.js)

  • Core Web Vitals (LCP, CLS, INP)
  • JavaScript error tracking (Sentry, LogRocket)
  • Bundle size monitoring (Webpack bundle analyzer)
  • CDN cache hit rate

Backend API (Node.js, Python, Go)

  • Request rate, error rate, duration (RED metrics)
  • Endpoint-level performance breakdown
  • Database query timing
  • External API call timing

Database (PostgreSQL, MySQL)

  • Active connections vs pool size
  • Slow queries (> 100ms)
  • Replication lag (if using replicas)
  • Table and index sizes (bloat detection)
  • Deadlock count

Cache (Redis, Memcached)

  • Hit rate (target: > 95%)
  • Memory usage vs max memory
  • Eviction rate
  • Connection count

Queue (RabbitMQ, SQS, Celery)

  • Queue depth (growing = processing can't keep up)
  • Processing time per message
  • Dead letter queue size
  • Consumer count

Incident Classification for SaaS

Severity Criteria Response Time Example
SEV1 All users affected, revenue impacted Immediate App fully down
SEV2 Subset affected, core flow broken 15 minutes Payments failing
SEV3 Non-core feature broken 1 hour Reporting broken
SEV4 Minor issue, workaround exists Next business day UI cosmetic bug

The Minimum Viable Monitoring Stack

Starting from zero? Here's what to set up in order:

  1. External uptime monitoring (Day 1)
  2. Main URL + API + payment endpoint
  3. Multi-region checks every 5 minutes
  4. Email + Telegram/Slack alerts

  5. Application error tracking (Week 1)

  6. Sentry or equivalent for exception tracking
  7. Source maps for meaningful stack traces

  8. Server monitoring (Week 1)

  9. Agent on each server for CPU/RAM/disk
  10. Database connection monitoring

  11. Status page (Week 1)

  12. Public page for customer communication
  13. Linked from footer and support docs

  14. On-call rotation (Month 1)

  15. At least 2 people rotating weekly
  16. Escalation chain documented

  17. Performance monitoring (Month 1-2)

  18. APM instrumentation for latency tracking
  19. Slow query logging
  20. Core Web Vitals tracking

  21. Business metric monitoring (Month 2-3)

  22. Payment success rate tracking
  23. Signup funnel monitoring
  24. Key feature usage metrics

Conclusion

Monitoring a SaaS application is not a one-time setup — it's an ongoing practice that grows with your product. Start with external uptime monitoring (the cheapest, highest-impact investment), then layer in application performance, infrastructure metrics, and business KPIs as your team and product mature. The goal is always the same: know about problems before your customers do, fix them fast, and learn from every incident to prevent the next one.