Incident Summary:
On August 15, 2024, at 3:36 PM AEST, a new version of the platform was deployed to production by the PayFurl vendor. This release included stricter order validation rules that inadvertently disrupted transaction processing, specifically affecting integrations with Chargebee. The issue led to failed transaction validations, temporarily preventing the processing of payments.
PayFurl was alerted to the issue by Chargebee on August 16, 2024, at 4:46 AM AEST. A fix was successfully deployed by 6:26 AM AEST on the same day, restoring normal transaction processing.
Services Impacted:
Merchants integrated with Chargebee who use the NMI, Ecentric, and Dlocal gateways experienced disrupted payment services during the incident. Other merchants not using these specific gateways were unaffected.
Root Cause:
The incident was caused by a new validation rule introduced in PayFurl’s latest release. This rule required that order items be present whenever an order number was provided, in line with gateway validation standards. However, this stricter validation conflicted with Chargebee’s existing logic, which allowed an order number to be populated without associated order items. This mismatch led to transaction failures originating from Chargebee, resulting in 4xx errors during payment processing.
Although our monitoring systems effectively tracked server-side 5xx errors, there were no alerts configured for 4xx errors, delaying our detection and resolution of the issue.
Remediations & Follow-Up Steps:
A fix was deployed on August 16, 2024, at 6:26 AM AEST, relaxing the validation rule to allow transactions to proceed even when only the order number is populated.
An automated test has been added to the PayFurl platform to specifically cover the interaction between PayFurl’s validation rules and Chargebee’s integration. This will prevent similar issues from occurring in future updates.
Monitoring has been expanded to include 4xx errors, with alerts set to notify the team if no successful transactions occur within an hour, enabling quicker identification and resolution of similar incidents.