We experienced another failure in our global messaging bus today. The failure started at 11:50 PT, with services fully restored by 12:13PT. This is the second time this had occurred in the last week.
One of our primary design goals of LiveKit Cloud is to have nothing shared between data centers, in order to remain operational even when entire data centers become unavailable.
We’ve discovered in recent weeks that the message bus component that we utilize (NATS JetStream) does not provide the isolation that we thought it did. Today’s outage was caused by a hardware failure on a single instance of NATS. That failure appeared to render the entire JetStream system unusable until all instances are restarted together. This appears to be a bug in the version of NATS that we are using.
While we believe the bug will be resolved by the NATS team, it’s clear that having reliance on JetStream does not give us the isolation guarantees that we need for LiveKit Cloud. We will be making some key architecture changes that remove that reliance over the next month. It is imperative to the LiveKit team that Cloud is providing reliable infrastructure that can withstand all kinds of failures, hardware or software, without disrupting real-time services.