AWS Outage Impacting Chargebee Services

Incident Report for Chargebee

Postmortem

Incident Overview:

On October 20, 2025, a major outage in the AWS US-EAST-1 region impacted multiple global services, including components critical to our infrastructure. Between 06:46 AM UTC and 09:30 AM UTC, we experienced major failures in our systems primarily due to DNS resolution issues and EC2 instance scaling impairments within AWS. This resulted in degraded performance and limited availability of our services during that window.

Root Cause and Timelines:

06:46 AM – 09:30 AM UTC, 20th October (Major Impact Window)

  • We began observing increased error rates and DNS lookup failures from 06:48 AM UTC. Critical AWS services such as DynamoDB, SQS, STS, EC2, and Lambda were fully degraded.
  • The AWS outage - rooted in a DNS resolution failure of EC2 services - prevented us from provisioning new instances and caused some of our application servers to become unhealthy. Since we couldn’t provision new instances, to ensure uptime, we started shedding load on existing instances by moving low-priority workloads to a separate instance.
  • By 09:30 AM UTC, core services were partially stabilized, though intermittent DNS errors persisted.

Recovery:

After 09:30 AM UTC, 20 October

  • We noticed some application servers went unhealthy, and we restarted some of the servers as autoscaling/new instance provisioning was not available.
  • We continued to observe random DNS lookup failures from AWS, averaging about 2k errors per hour.

Job Rescheduling and Completion:

  • To avoid further load on the system during the AWS recovery phase, we paused rescheduling of failed jobs until AWS fully declared that their services were fully operational, esp their auto scaling functionality.
  • AWS gradually restored autoscaling and EC2 instances by 09:30 PM UTC on 20th October.
  • By 11:00 PM UTC on 20th October, system stability had largely been restored. Some upstream third-party services were still not completely recovered.
  • All critical jobs were rescheduled in a staggered way so as not to overload the system and completed by 02:45 PM UTC on 21st October.
  • All non-critical jobs were rescheduled, and the entire activity was completed by 02:15 PM UTC on 22nd October.

We have treated this downtime as a key learning opportunity and established a dedicated internal team to enhance our tooling and processes for faster recovery.

Posted Oct 30, 2025 - 07:47 UTC

Resolved

AWS has implemented the necessary fix and confirmed that all affected services have been fully restored. Our systems have also been monitored and are now operating normally. We appreciate your patience while we worked with our provider to restore full functionality. AWS will share a detailed post-incident summary, and we are marking this incident as resolved.
Posted Oct 21, 2025 - 02:33 UTC

Update

Some dependent issues with AWS are still being addressed. We continue to monitor our systems closely for any impact.
Posted Oct 20, 2025 - 15:52 UTC

Update

AWS has identified and mitigated the issue affecting multiple dependent services. Most AWS operations have recovered, but some residual latency may persist as backlogs clear. We continue to monitor our systems closely for any impact.
Posted Oct 20, 2025 - 14:07 UTC

Monitoring

Between 07:00 AM and 09:27 AM UTC, we experienced increased error rates and latency across multiple services due to an AWS outage specifically related to a DNS resolution issue impacting DynamoDB.

Our systems have now recovered, and services are operating normally. We will continue to monitor and provide further updates as needed.
Posted Oct 20, 2025 - 11:07 UTC

Update

Latest note from AWS: "We continue to observe recovery across most of the affected AWS Services. We can confirm global services and features that rely on US-EAST-1 have also recovered. We continue to work towards full resolution and will provide updates as we have more information to share"
Posted Oct 20, 2025 - 10:09 UTC

Update

We have an update from AWS, and we will continue to monitor: "We are seeing significant signs of recovery. Most requests should now be succeeding. We continue to work through a backlog of queued requests. We will continue to provide additional information."
Posted Oct 20, 2025 - 09:47 UTC

Identified

We are currently experiencing a service interruption due to an ongoing AWS outage. Our team is working with AWS to restore full access as soon as possible. For more details - https://health.aws.amazon.com/health/status

We sincerely appreciate your patience and understanding during this time.
Posted Oct 20, 2025 - 09:42 UTC
This incident affected: Chargebee UI (Admin Console (US), Admin Console (EU), Admin Console (AU)), API Endpoints (API (US), API (EU), API (AU)), Checkout (Checkout (US), Checkout (EU), Checkout (AU)), Webhooks (Webhooks (US), Webhooks (EU), Webhooks (AU)), and Reports And Analytics (Dashboard (US), Dashboard (EU), Dashboard (AU)).