Connectivity Issues
Incident Report for LiveKit
Postmortem

We experienced another failure in our global messaging bus today. The failure started at 11:50 PT, with services fully restored by 12:13PT. This is the second time this had occurred in the last week.

One of our primary design goals of LiveKit Cloud is to have nothing shared between data centers, in order to remain operational even when entire data centers become unavailable.

We’ve discovered in recent weeks that the message bus component that we utilize (NATS JetStream) does not provide the isolation that we thought it did. Today’s outage was caused by a hardware failure on a single instance of NATS. That failure appeared to render the entire JetStream system unusable until all instances are restarted together. This appears to be a bug in the version of NATS that we are using.

While we believe the bug will be resolved by the NATS team, it’s clear that having reliance on JetStream does not give us the isolation guarantees that we need for LiveKit Cloud. We will be making some key architecture changes that remove that reliance over the next month. It is imperative to the LiveKit team that Cloud is providing reliable infrastructure that can withstand all kinds of failures, hardware or software, without disrupting real-time services.

Posted Feb 13, 2023 - 15:29 PST

Resolved
This incident has been resolved.
Posted Feb 13, 2023 - 15:04 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 13, 2023 - 12:13 PST
Identified
Message bus outage
Posted Feb 13, 2023 - 12:06 PST
Investigating
We are currently investigating this issue.
Posted Feb 13, 2023 - 12:00 PST
This incident affected: Global Real Time Communication.