Intermittent connectivity disruptions in US regions

Incident Report for LiveKit

Postmortem

Summary

LiveKit Cloud runs as a distributed realtime network, with data centers around the world interconnected via dedicated networking. Even with dedicated fiber, momentary disruptions between any two data centers can and do occur. To handle these blips, we have built a comprehensive set of resilience mechanisms that automatically reroute and relay traffic over alternate healthy paths. For example, if data centers A and B cannot reach one another cleanly but both can reach data center C, we use C as a relay so that traffic flows A to C to B. Under normal circumstances, network blips between our data centers are handled transparently there would be no visible impact.

Between 2026-05-07 and 2026-05-19, a small number of these otherwise routine network blips did become customer-visible in our US regions. Affected sessions experienced a brief interruption to media (under 5 minutes) before recovering on a new path.

Impact

Five short windows of disruption were observed, each tied to a brief network blip between certain clusters:

  • 2026-05-07 17:16 UTC
  • 2026-05-12 05:00 UTC
  • 2026-05-12 17:00 UTC
  • 2026-05-13 19:00 UTC
  • 2026-05-15 07:30 UTC
  • 2026-05-19 15:01 UTC

The majority of sessions traversed alternate paths normally and were not affected. A subset of sessions whose traffic happened to be relayed through a region in the specific failure state described below experienced a media interruption of up to ~5 minutes before re-routing onto a healthy path.

Root Cause

On 2026-05-06, a change was deployed that altered the relay process. The change introduced a subtle bug that required two conditions to occur simultaneously to manifest:

  1. The relay region was under-utilized and had not cached a particular piece of state needed by the relay process.
  2. A large volume of traffic was directed to that relay region in a very short window of time.

When both conditions were present, the relay process would be stuck and would take minutes to fully catch up. During that time, neither endpoint of the relayed session could continue to receive media from the other.

Because both conditions are narrow (a cold-cache relay region absorbing a sudden burst of traffic), the bug did not surface during pre-deploy testing, and it did not trigger on every network blip. It only manifested when a real network disruption happened to redirect a sufficiently large burst of traffic to a relay region that had not warmed its cache. Once that occurred on a given relay, sessions flowing through it stalled until traffic shifted off that path.

We tracked down the root cause on 2026-05-20, and a fix has been fully rolled out across the fleet.

Corrective Actions & Prevention

  • Improved integration tests for relay locks under load. We are adding tests that exercise the relay process with cold caches and sudden traffic bursts, so this class of contention is caught before deploy. We are also auditing related lock paths to ensure they behave correctly under similar load profiles.

We sincerely apologize for the disruption to customers whose sessions were affected during these windows. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.

Posted May 22, 2026 - 00:39 PDT

Resolved

We are resolving this incident as we have not observed any further connectivity disruptions. We will follow up with a postmortem once available.
Posted May 15, 2026 - 19:50 PDT

Monitoring

We have not observed further instances of connectivity disruptions since 13:19 UTC. We are continuing to investigate why our reroute mechanism did not activate in these isolated instances and are monitoring for further disruptions.
Posted May 15, 2026 - 14:16 PDT

Investigating

We are investigating intermittent connectivity disruptions affecting a subset of sessions, caused by internal network degradation between several of our US clusters (most heavily impacting US East). Affected sessions may see brief interruptions (less than 5 minutes) to media relay. We have observed these interruptions at the following times:
- 5/12/26 05:00 UTC
- 5/12/26 17:00 UTC
- 5/13/26 19:00 UTC
- 5/15/26 07:30 UTC

We are actively investigating. Further updates to follow.
Posted May 15, 2026 - 09:57 PDT