Service Interruption In US Region

Incident Report for Chargebee

Postmortem

Incident Summary:

On 26th June at 05:57 UTC, we experienced a significant surge in traffic targeting one of our customers, including the payment URLs managed by Chargebee. This unexpected increase in requests led to service interruptions as our systems, including AWS Application Load Balancers (ALBs), were unable to scale adequately.

Current Setup:

Our current infrastructure is designed with multiple clusters to effectively minimize the impact radius during both internal changes and external load events. We utilize AWS's Layer7 functionality to manage overall routing to these clusters. We also have configured WAF with robust rules to block malicious traffic.

Root Cause:

The root cause of the downtime was a targeted attack involving requests from different IPs directed at one of our customer’s websites and their payment URLs (Chargebee domain). The high volume of the attack overwhelmed the system's capacity to scale, particularly affecting our AWS Load Balancers.

Although AWS WAFs were enabled, the magnitude and speed of the attack within a short duration still caused a significant problem in the AWS Load Balancers, preventing the WAF from even evaluating the incoming requests. This resulted in an impact on all the clusters even though the underlying instances were healthy.

Steps Taken & Way Forward:

A tool has been provided to the 24/7 operations team along with SOP to quickly implement the honeypot-based routing mitigation to be done during DDoS attacks.
Evaluating additional DDoS mitigation providers to replace/supplement AWS WAF.
We have planned to enhance our mitigation strategy by having traffic distribution with multiple load balancers in the routing layer.

We aim to enhance our system's resilience against similar attacks and ensure uninterrupted service for our customers. We will continue to monitor and adapt our strategies as needed to address evolving security threats.

Posted Jul 02, 2024 - 07:53 UTC

Resolved

This incident has been resolved.

Posted Jun 26, 2024 - 06:50 UTC

Update

We are continuing to monitor for any further issues.

Posted Jun 26, 2024 - 06:45 UTC

Monitoring

On June 26, 2024, between 05:57 and 06:17 UTC, we experienced brief server timeout issues for customers in the US region. We apologize for any inconvenience this may have caused.

Posted Jun 26, 2024 - 06:37 UTC

This incident affected: Chargebee UI (Admin Console (US)), API Endpoints (API (US)), and Checkout (Checkout (US)).