During our deployment, one of the task to perform was to remove a unique constraint in a table whose column we are deprecating. The column we are removing also serves as an index for this used table. When we decided to remove the index, we failed to measure the impact of the queries on this table; this led us to underestimate the impact of the index removal. Right after we removed the index, the database CPU hit 100% causing the requests to pile up and timeout.
Though we were ready for the rollback plan, executing the query to reinstate the index requires a database meta-lock, which was unavailable due to the queries running. Instead of bringing all the services down, we stopped all the back end jobs, but still, the web requests kept hitting this table. By now, we were sure that only way to fix the problem is by bringing down all the servers. We brought the entire service down, added the index back.
The incident from the first error to normal operation took 19 minutes (05:07 UTC to 05:26 UTC). The entire app was 100% down for 9 minutes from 05:17 UTC to 05:26 UTC.