While our monitoring system detected an elevated failure rate in CreateRoom API requests, it did not reach the threshold for a P1 or P2 incident. This was due to the API failure rate alarm being configured for aggregate rates rather than specific to each data center.
Further investigation revealed that due to a connectivity issue that took place earlier between US East 1 and Germany DCs, one of the components involved in room creation was not able to synchronize data quickly enough to keep up with the rate of requests.
It's important to note that real-time traffic and other API calls remained unaffected.
The root cause was identified as a failure in JetStream, our distributed message store used to ensure consistent room creation across the global mesh network.
We had concluded earlier this year was that JetStream did not have the fault tolerance profile that suited a real-time, globally distributed service. Although we’ve removed JetStream from our real-time stack. Room creation via API still had JetStream in its critical path. A replacement system is under development but is not yet operational.
The issue was resolved by resetting the queue and connections between JetStream instances.
To prevent similar incidents, we have implemented the following measures: