CreateRoom API is timing out

Incident Report for LiveKit

Postmortem

Incident Timeline

20:30 UTC: we’ve received user reports that CreateRoom API had been failing.

Given the proximity to the incident in US East 1, we had expected the two events to be related. The first part of our investigation was around any remaining impact of US East 1, which was not actively serving traffic

21:00 UTC: we confirmed the issues were happening even for users that would not normally connect to US East 1.

21:11 UTC: we noticed that the volume of CreateRoom calls as well as DB queries were at 10x the volume that they were typically. and began to try to reduce the amount of queries.

21:25 UTC: it was determined that the bottleneck was our database, where the entire cluster was pushed to 100% utilization, and thus was fulfilling queries slowly. We also realized while CreateRoom API calls timed out, the rooms had been correctly created.

21:30 UTC: we began mitigating CPU issues on our database by a combination of rate limiting and upgrading DB compute capacity.

22:00 UTC: DB load has started to come down, 500 rate for CreateRoom started to drop

22:05 UTC: CreateRoom is fully functional, error rate had dropped to a minimal amount. We reverted rate limit measures that were taken.

Root Cause Analysis

The incident originated from a sudden failover at the US East 1 data center, which led to an unexpected surge of traffic to US West. Each data center maintains an in-memory cache of active rooms that runs within that data center. As users were redirected from US East 1 to US West, the latter's cache did not contain records of these rooms, necessitating on-the-fly database queries to load the required information.

This abrupt increase in database (DB) queries significantly elevated DB utilization, slowing down query execution speeds. Consequently, some queries exceeded the API's maximum wait time, initiating a series of retries both within our internal codebase and from client applications. This situation escalated into a feedback loop of CreateRoom requests, overwhelming the DB to the point where it began to timeout on most queries, effectively causing a cascading failure.

Real-time sessions do not rely on the database, so they continued to function without a problem. Other APIs also were not impacted.

Remediation Steps

Optimizing Internal Timeout: Adjust internal API timeouts to exceed the database's query timeout thresholds, preventing unnecessary query repetitions that strain the database.
Adjust Retry Mechanisms: Implement exponential back-off strategies in critical code paths to manage retries more efficiently and reduce the risk of overwhelming the system during peak loads.
Decouple from Database Dependencies: LiveKit Cloud's architecture predominantly avoids relying on databases to circumvent SPOF inherent in many systems. Despite this, our current implementation of the CreateRoom API still depends on a database. Our forthcoming engineering efforts will focus on eliminating database dependencies for all real-time service operations, particularly for the CreateRoom functionality

Posted Feb 07, 2024 - 23:41 PST

Resolved

This incident has been resolved. We will provide a post mortem in the coming days.

Posted Feb 07, 2024 - 15:49 PST

Update

We are continuing to monitor for any further issues.

Posted Feb 07, 2024 - 15:42 PST

Monitoring

A fix has been implemented API service is returning to normal. We are continuing to monitor for any other issues.

Posted Feb 07, 2024 - 14:10 PST

Identified

We've identified the issue and are working on mitigations.

Posted Feb 07, 2024 - 13:22 PST

Investigating

We are currently investigating this incident.
Sessions with auto-creation should not be impacted.

Posted Feb 07, 2024 - 13:07 PST

This incident affected: Global Real Time Communication.