Your Cloud Provider Will Go Down: How to Prepare for AWS, GCP, and Azure Outages
It's not a question of if your cloud provider will have an outage — it's when. AWS us-east-1 went down in December 2021, taking hundreds of services with it. Google Cloud had a global networking outage in 2023. Azure's DNS failure in 2024 affected millions of users. If your entire infrastructure runs in one cloud, one region, or one availability zone, you're one failure away from a total outage.
Notable Cloud Provider Outages
| Provider | Date | Duration | Impact |
|---|---|---|---|
| AWS us-east-1 | Dec 2021 | ~5 hours | Netflix, Disney+, Slack, many others |
| Google Cloud | Aug 2023 | ~2 hours | Global networking, affects all services |
| Azure | Jan 2023 | ~15 hours | Multiple regions, auth services down |
| AWS S3 | Feb 2017 | ~4 hours | Broke half the internet |
| Cloudflare | Jun 2022 | ~1.5 hours | 19 data centers unreachable |
Key observation: These aren't small providers. They're the best in the world, and they still have outages. Your preparation determines whether their outage becomes your outage.
Understanding Cloud Failure Modes
Single Service Failure
One specific service fails (e.g., RDS, Lambda) while others continue working. Impact: Depends on your architecture. If you use that service, you're affected.
Availability Zone Failure
One physical data center within a region loses power, network, or cooling. Impact: Workloads in that AZ go down. Multi-AZ workloads continue.
Regional Failure
An entire region (e.g., us-east-1) becomes unavailable. Impact: Everything in that region goes down. Cross-region failover needed.
Global Failure
A control plane or global service fails (IAM, DNS, the console itself). Impact: Cannot manage resources or authenticate. Existing workloads may continue.
Partial / Degraded
Services don't fully fail but become slow, intermittent, or return errors. Impact: The hardest to detect. Your monitoring may not trigger, but users suffer.
Preparation Strategies
Level 1: Multi-AZ (Within One Region)
Protects against: Single data center failure Cost: Low (10-20% more than single AZ)
- Deploy application instances across 2-3 AZs
- Use multi-AZ database (RDS Multi-AZ, Cloud SQL HA)
- Load balancer distributes traffic across AZs
- Storage replicated across AZs (S3, GCS are multi-AZ by default)
This is the minimum for production workloads.
Level 2: Multi-Region (Within One Cloud)
Protects against: Regional failure Cost: Moderate (50-100% more)
- Active-passive: Primary region handles traffic, secondary is warm standby
- Active-active: Both regions serve traffic, data replicated between them
- DNS-based failover: Route53/Cloud DNS switches traffic on failure
Challenges: - Data replication lag (consistency vs availability) - Cross-region data transfer costs - Testing failover regularly - Stateful services (databases) are hardest to replicate
Level 3: Multi-Cloud
Protects against: Provider-wide outage, vendor lock-in Cost: High (2x+ infrastructure + engineering time)
- Critical services deployed on 2+ cloud providers
- Cloud-agnostic tools (Kubernetes, Terraform)
- DNS-based traffic routing between clouds
- Data replicated across providers
Reality check: True multi-cloud is extremely expensive and complex. Most companies are better served by multi-region within one cloud + external monitoring.
Practical Steps You Can Take Today
1. External Monitoring (Essential)
Your cloud provider's monitoring tools are hosted ON that cloud provider. When the provider goes down, your monitoring goes down with it.
Solution: Use an external monitoring service that runs on independent infrastructure:
- Monitor from multiple regions outside your cloud provider
- Alert via channels that don't depend on your cloud (Telegram, SMS, external email)
- Monitor your cloud provider's status page programmatically
2. DNS Failover Setup
Use a DNS provider different from your cloud provider:
yoursite.com → Primary: AWS ALB (us-east-1)
→ Failover: GCP LB (europe-west1)
DNS health check: if AWS returns 5xx for 3 checks → switch to GCP
3. Static Failover Page
Even without full multi-cloud, you can serve a static "we're experiencing issues" page from a different provider:
- Host a static site on Cloudflare Pages, Netlify, or a different cloud
- Configure DNS failover to switch to the static page when your main app is down
- Include: incident status, expected resolution time, alternative contact methods
4. Backup Data Outside Your Primary Cloud
- Database backups stored in a different cloud or on-premises
- Critical configuration exported and versioned (Terraform state, secrets)
- Container images mirrored to a second registry
5. Document Your Cloud Dependencies
Create a dependency map:
| Service | Cloud Service | Region | Multi-AZ? | Failover Plan |
|---|---|---|---|---|
| API servers | EC2 / ECS | us-east-1 | Yes (3 AZ) | Manual failover to us-west-2 |
| Database | RDS PostgreSQL | us-east-1 | Yes (Multi-AZ) | Read replica in us-west-2 |
| File storage | S3 | us-east-1 | Auto (regional) | Cross-region replication |
| DNS | Route 53 | Global | N/A | Secondary: Cloudflare |
| CDN | CloudFront | Global | N/A | Fallback: direct to origin |
| SES | us-east-1 | No | Backup: SendGrid |
Detecting Cloud Provider Issues
Monitor the Monitors
Subscribe to your cloud provider's status feeds: - AWS: https://health.aws.amazon.com - GCP: https://status.cloud.google.com - Azure: https://status.azure.com
But don't trust them blindly — providers are often slow to acknowledge issues.
Independent Detection
Your external monitoring service will often detect cloud issues before the provider acknowledges them:
- Increased latency from multiple monitoring regions
- Intermittent failures that don't match your deployment patterns
- Correlated failures across multiple services in the same region
Community Signals
- Twitter/X: #AWSdown, #GCPdown, #Azuredown
- Hacker News: outage reports surface within minutes
- DownDetector: crowdsourced outage tracking
During a Cloud Provider Outage
If You Have Failover
- Verify the issue is the cloud provider, not your application
- Trigger failover (automatic or manual depending on your setup)
- Verify failover is serving traffic correctly
- Monitor secondary region closely (it may be under unusual load)
- Communicate to users via status page
If You Don't Have Failover
- Confirm it's a provider issue (check status page, Twitter, community)
- Switch DNS to static failover page if available
- Communicate to users: "Our cloud provider is experiencing issues"
- Monitor provider status for resolution
- Plan improvements for next time
Conclusion
Cloud providers are incredibly reliable — but not infallible. The question isn't whether to trust your cloud provider (you should), but how to design your system so that a provider outage doesn't become a customer outage. Start with external monitoring, add multi-AZ deployment, and invest in multi-region or multi-cloud only when the business justifies the cost.