We had two outages related to JetStream, our message bus/persistence solution.
On Sunday, March 26th at 9:51PM PT, our monitoring systems picked up errors relating to JetStream in the Singapore region. RTC media servers began shutting themselves down due to their inability to read/write to JetStream. We started to disable the region in order to route traffic elsewhere.
At 10:01PM (T+0mins), that error had spread to other regions, causing a global outage. After identifying the issue to be with JetStream, we began to restart NATS globally. This should have caused it to re-establish raft groups and come online.
10:09PM (T+8mins), we completed the restart, only to discover that while NATS itself was back up, JetStream remained down. Judging from the logs, leader election was not happening.
10:21PM (T+20mins), attempts to force leader election and troubleshoot the raft groups did not work. We decided to upgrade to a version of NATS that we had been validating in staging, since it does contain relevant bug fixes.
10:30pm (T+29mins), after upgrading NATS, JetStream came back online. Media servers started back up again and we were operational.
On Monday, March 27th at 10:20PM PT, our monitoring systems picked up warnings due to JetStream spewing "consumer assignment not cleaned up" messages. In the past, we've seen similar errors precede JetStream issues. Monitoring stream health, it seemed that the Singapore region was falling behind in replication. At this point, all regions remain operational, but we were concerned about the increased volume of warnings.
At 10:29PM, Singapore was not able to communicate with JetStream, causing some connections to fail. We decided to disable Singapore in order to route traffic to other regions.
10:41PM (T+0mins), JetStream in other regions started to fail, causing another global outage. Looking at traffic charts, it appeared prior to the JetStream failure, the number of internal messages it was processing escalated by 10x. We suspected it was due to overly-eager internal retries, which made the problem worse.
10:43PM (T+2mins), we decided to repeat the procedure of shutting down NATS globally and subsequently bring them back up to re-sync raft state.
10:55PM (T+14mins), NATS restart was completed and services started coming back online.
11:00PM (T+19mins), services were fully operational.
11:39PM (T+58mins), we started observing an increased error rate with JetStream again being the culprit. This caused some sessions to fail or to require client-side retries in order to connect.
After experiencing a prior global outage in February, we had been working to remove JetStream from our infrastructure. One of our key design goals with LiveKit Cloud was to have full isolation between regions: a failure in one data center must not impact operations in other data centers. It became clear to us earlier this year that JetStream did not conform to those design goals.
Since then, we had been working on a distributed data synchronization solution called SyncStore. It had been running in our pre-production environment with select customers for the last two weeks. Our original plan was to release it in production first in "shadow mode", where it performs synchronization in the background to ensure data validity while still deferring to JetStream as the primary synchronizer.
However, after two outages in a row, we decided to deploy it immediately in order to reduce the likelihood of additional failures.
11:55PM we began the upgrade to SyncStore. The upgrades were completed by 12:30AM.
As of today, our RTC services do not rely on JetStream to function. User connections will continue to work even when JetStream is completely down. In the next few months, we'll be moving other services (such as analytics) off of JetStream as well.