20:30 UTC: we’ve received user reports that CreateRoom API had been failing.
Given the proximity to the incident in US East 1, we had expected the two events to be related. The first part of our investigation was around any remaining impact of US East 1, which was not actively serving traffic
21:00 UTC: we confirmed the issues were happening even for users that would not normally connect to US East 1.
21:11 UTC: we noticed that the volume of CreateRoom calls as well as DB queries were at 10x the volume that they were typically. and began to try to reduce the amount of queries.
21:25 UTC: it was determined that the bottleneck was our database, where the entire cluster was pushed to 100% utilization, and thus was fulfilling queries slowly. We also realized while CreateRoom API calls timed out, the rooms had been correctly created.
21:30 UTC: we began mitigating CPU issues on our database by a combination of rate limiting and upgrading DB compute capacity.
22:00 UTC: DB load has started to come down, 500 rate for CreateRoom started to drop
22:05 UTC: CreateRoom is fully functional, error rate had dropped to a minimal amount. We reverted rate limit measures that were taken.
The incident originated from a sudden failover at the US East 1 data center, which led to an unexpected surge of traffic to US West. Each data center maintains an in-memory cache of active rooms that runs within that data center. As users were redirected from US East 1 to US West, the latter's cache did not contain records of these rooms, necessitating on-the-fly database queries to load the required information.
This abrupt increase in database (DB) queries significantly elevated DB utilization, slowing down query execution speeds. Consequently, some queries exceeded the API's maximum wait time, initiating a series of retries both within our internal codebase and from client applications. This situation escalated into a feedback loop of CreateRoom
requests, overwhelming the DB to the point where it began to timeout on most queries, effectively causing a cascading failure.
Real-time sessions do not rely on the database, so they continued to function without a problem. Other APIs also were not impacted.