Incident Overview:
On October 20, 2025, a major outage in the AWS US-EAST-1 region impacted multiple global services, including components critical to our infrastructure. Between 06:46 AM UTC and 09:30 AM UTC, we experienced major failures in our systems primarily due to DNS resolution issues and EC2 instance scaling impairments within AWS. This resulted in degraded performance and limited availability of our services during that window.
Root Cause and Timelines:
06:46 AM – 09:30 AM UTC, 20th October (Major Impact Window)
- We began observing increased error rates and DNS lookup failures from 06:48 AM UTC. Critical AWS services such as DynamoDB, SQS, STS, EC2, and Lambda were fully degraded.
- The AWS outage - rooted in a DNS resolution failure of EC2 services - prevented us from provisioning new instances and caused some of our application servers to become unhealthy. Since we couldn’t provision new instances, to ensure uptime, we started shedding load on existing instances by moving low-priority workloads to a separate instance.
- By 09:30 AM UTC, core services were partially stabilized, though intermittent DNS errors persisted.
Recovery:
After 09:30 AM UTC, 20 October
- We noticed some application servers went unhealthy, and we restarted some of the servers as autoscaling/new instance provisioning was not available.
- We continued to observe random DNS lookup failures from AWS, averaging about 2k errors per hour.
Job Rescheduling and Completion:
- To avoid further load on the system during the AWS recovery phase, we paused rescheduling of failed jobs until AWS fully declared that their services were fully operational, esp their auto scaling functionality.
- AWS gradually restored autoscaling and EC2 instances by 09:30 PM UTC on 20th October.
- By 11:00 PM UTC on 20th October, system stability had largely been restored. Some upstream third-party services were still not completely recovered.
- All critical jobs were rescheduled in a staggered way so as not to overload the system and completed by 02:45 PM UTC on 21st October.
- All non-critical jobs were rescheduled, and the entire activity was completed by 02:15 PM UTC on 22nd October.
We have treated this downtime as a key learning opportunity and established a dedicated internal team to enhance our tooling and processes for faster recovery.