Incident Summary:
On 26th June at 05:57 UTC, we experienced a significant surge in traffic targeting one of our customers, including the payment URLs managed by Chargebee. This unexpected increase in requests led to service interruptions as our systems, including AWS Application Load Balancers (ALBs), were unable to scale adequately.
Current Setup:
Our current infrastructure is designed with multiple clusters to effectively minimize the impact radius during both internal changes and external load events. We utilize AWS's Layer7 functionality to manage overall routing to these clusters. We also have configured WAF with robust rules to block malicious traffic.
Root Cause:
The root cause of the downtime was a targeted attack involving requests from different IPs directed at one of our customer’s websites and their payment URLs (Chargebee domain). The high volume of the attack overwhelmed the system's capacity to scale, particularly affecting our AWS Load Balancers.
Although AWS WAFs were enabled, the magnitude and speed of the attack within a short duration still caused a significant problem in the AWS Load Balancers, preventing the WAF from even evaluating the incoming requests. This resulted in an impact on all the clusters even though the underlying instances were healthy.
Steps Taken & Way Forward:
We aim to enhance our system's resilience against similar attacks and ensure uninterrupted service for our customers. We will continue to monitor and adapt our strategies as needed to address evolving security threats.