High CreateRoom failure rate in US East 1

Incident Report for LiveKit

Postmortem

Incident summary

While our monitoring system detected an elevated failure rate in CreateRoom API requests, it did not reach the threshold for a P1 or P2 incident. This was due to the API failure rate alarm being configured for aggregate rates rather than specific to each data center.

Further investigation revealed that due to a connectivity issue that took place earlier between US East 1 and Germany DCs, one of the components involved in room creation was not able to synchronize data quickly enough to keep up with the rate of requests.

It's important to note that real-time traffic and other API calls remained unaffected.

Root cause

The root cause was identified as a failure in JetStream, our distributed message store used to ensure consistent room creation across the global mesh network.

We had concluded earlier this year was that JetStream did not have the fault tolerance profile that suited a real-time, globally distributed service. Although we’ve removed JetStream from our real-time stack. Room creation via API still had JetStream in its critical path. A replacement system is under development but is not yet operational.

The issue was resolved by resetting the queue and connections between JetStream instances.

Remediation plan

To prevent similar incidents, we have implemented the following measures:

Updated our API failure rate monitor to target each data center individually.
Implemented a temporary workaround to bypass JetStream failures in the CreateRoom function.
We will accelerate the complete removal of JetStream from API usage.

Posted Nov 16, 2023 - 01:20 PST

Resolved

User reports indicated CreateRoom failures beginning at approximately 12:30 UTC, specifically impacting the US East 1 region. After thorough investigation, the team has identified and rectified the failing component, fully resolving the issue by 14:30 UTC.

Posted Nov 15, 2023 - 04:30 PST