How to Monitor Your SaaS Application: A Complete Guide
Running a SaaS application means your users expect 24/7 availability. Unlike a downloadable app that runs on the user's machine, every bug, every slow query, and every infrastructure hiccup directly affects your customers — and your revenue. Here's how to build monitoring that keeps your SaaS reliable.
The SaaS Monitoring Stack
A production SaaS application needs monitoring at five layers:
Layer 1: External Availability (User Perspective)
This is where you start. Can your users reach your application?
What to monitor: - Main application URL (login page, dashboard) - API endpoints (the ones your SPA or mobile app calls) - Public pages (marketing site, docs, blog) - CDN-served assets (JS bundles, images) - Third-party dependencies (payment gateway, auth provider)
How to monitor: - External HTTP checks every 1-5 minutes - From multiple geographic regions (your users aren't all in one city) - With keyword validation (catch error pages that return 200) - SSL certificate expiry monitoring
Alert on: Downtime from 2+ regions for 2+ consecutive checks.
Layer 2: Application Performance (User Experience)
Your app might be "up" but painfully slow. Users don't distinguish between "down" and "too slow to use."
What to monitor: - Response time (p50, p95, p99) per endpoint - Error rate (5xx responses / total responses) - Core Web Vitals (LCP, CLS, INP) - API-specific metrics (authentication success rate, search latency) - Background job processing time and queue depth
How to monitor: - Application Performance Monitoring (APM) instrumentation - Server-side request logging with timing - Client-side Real User Monitoring (RUM) via JS snippet - Synthetic monitoring for critical user journeys
Alert on: p99 > 3 seconds for 5 minutes, error rate > 1%.
Layer 3: Infrastructure (System Resources)
The servers, databases, and services running your app.
What to monitor: - CPU usage (sustained > 80% = problem) - Memory usage (approaching limit = OOM risk) - Disk usage (> 85% = time to clean up or scale) - Network I/O (bandwidth saturation) - Database connections (pool exhaustion) - Database query performance (slow query log) - Cache hit rate (Redis/Memcached) - Queue depth and processing rate
How to monitor: - Server monitoring agent (installed on each instance) - Database-specific monitoring (pg_stat_statements, slow query log) - Cloud provider metrics (CloudWatch, GCP Monitoring)
Alert on: Resource usage > 80% sustained for 10+ minutes.
Layer 4: Business Metrics (Revenue Impact)
Technical metrics don't tell you if the business is working.
What to monitor: - Signup conversion rate (landing page → registered) - Activation rate (registered → first meaningful action) - Payment success rate (attempted → completed) - Churn indicators (login frequency dropping) - Feature usage (are new features being adopted?)
How to monitor: - Analytics (Mixpanel, PostHog, or custom events) - Database queries on subscription/payment tables - Funnel tracking in your product
Alert on: Payment success rate drops below 95%, signup rate drops > 50% from baseline.
Layer 5: Security Monitoring
SaaS applications are high-value targets.
What to monitor: - Failed login attempts (brute force detection) - Unusual API usage patterns (rate limit violations) - SSL certificate changes (detect hijacking) - Dependency vulnerabilities (npm audit, pip-audit) - DNS record changes (unauthorized modifications)
How to monitor: - Application-level rate limiting with logging - DNS monitoring for record changes - Safe Browsing status checks - Regular dependency audits
Alert on: 100+ failed logins from one IP, DNS record change, SSL certificate mismatch.
Setting SLOs for Your SaaS
What Is an SLO?
A Service Level Objective is your internal target for reliability. It's more aggressive than your SLA (the promise to customers).
| Metric | SLO | SLA |
|---|---|---|
| Availability | 99.95% | 99.9% |
| API latency (p95) | < 500ms | < 2 seconds |
| API error rate | < 0.1% | < 1% |
How to Calculate Error Budget
Error budget = 1 - SLO
If SLO = 99.95%:
Error budget = 0.05% = 21.6 minutes/month
You can "spend" 21.6 minutes of downtime per month
before you breach your SLO.
When your error budget is running low: - Freeze non-critical deployments - Focus engineering effort on reliability - Postpone risky changes
Monitoring Your Tech Stack
Frontend (React, Vue, Next.js)
- Core Web Vitals (LCP, CLS, INP)
- JavaScript error tracking (Sentry, LogRocket)
- Bundle size monitoring (Webpack bundle analyzer)
- CDN cache hit rate
Backend API (Node.js, Python, Go)
- Request rate, error rate, duration (RED metrics)
- Endpoint-level performance breakdown
- Database query timing
- External API call timing
Database (PostgreSQL, MySQL)
- Active connections vs pool size
- Slow queries (> 100ms)
- Replication lag (if using replicas)
- Table and index sizes (bloat detection)
- Deadlock count
Cache (Redis, Memcached)
- Hit rate (target: > 95%)
- Memory usage vs max memory
- Eviction rate
- Connection count
Queue (RabbitMQ, SQS, Celery)
- Queue depth (growing = processing can't keep up)
- Processing time per message
- Dead letter queue size
- Consumer count
Incident Classification for SaaS
| Severity | Criteria | Response Time | Example |
|---|---|---|---|
| SEV1 | All users affected, revenue impacted | Immediate | App fully down |
| SEV2 | Subset affected, core flow broken | 15 minutes | Payments failing |
| SEV3 | Non-core feature broken | 1 hour | Reporting broken |
| SEV4 | Minor issue, workaround exists | Next business day | UI cosmetic bug |
The Minimum Viable Monitoring Stack
Starting from zero? Here's what to set up in order:
- External uptime monitoring (Day 1)
- Main URL + API + payment endpoint
- Multi-region checks every 5 minutes
-
Email + Telegram/Slack alerts
-
Application error tracking (Week 1)
- Sentry or equivalent for exception tracking
-
Source maps for meaningful stack traces
-
Server monitoring (Week 1)
- Agent on each server for CPU/RAM/disk
-
Database connection monitoring
-
Status page (Week 1)
- Public page for customer communication
-
Linked from footer and support docs
-
On-call rotation (Month 1)
- At least 2 people rotating weekly
-
Escalation chain documented
-
Performance monitoring (Month 1-2)
- APM instrumentation for latency tracking
- Slow query logging
-
Core Web Vitals tracking
-
Business metric monitoring (Month 2-3)
- Payment success rate tracking
- Signup funnel monitoring
- Key feature usage metrics
Conclusion
Monitoring a SaaS application is not a one-time setup — it's an ongoing practice that grows with your product. Start with external uptime monitoring (the cheapest, highest-impact investment), then layer in application performance, infrastructure metrics, and business KPIs as your team and product mature. The goal is always the same: know about problems before your customers do, fix them fast, and learn from every incident to prevent the next one.