LiveKit Cloud runs as a distributed realtime network, with data centers around the world interconnected via dedicated networking. Even with dedicated fiber, momentary disruptions between any two data centers can and do occur. To handle these blips, we have built a comprehensive set of resilience mechanisms that automatically reroute and relay traffic over alternate healthy paths. For example, if data centers A and B cannot reach one another cleanly but both can reach data center C, we use C as a relay so that traffic flows A to C to B. Under normal circumstances, network blips between our data centers are handled transparently there would be no visible impact.
Between 2026-05-07 and 2026-05-19, a small number of these otherwise routine network blips did become customer-visible in our US regions. Affected sessions experienced a brief interruption to media (under 5 minutes) before recovering on a new path.
Five short windows of disruption were observed, each tied to a brief network blip between certain clusters:
The majority of sessions traversed alternate paths normally and were not affected. A subset of sessions whose traffic happened to be relayed through a region in the specific failure state described below experienced a media interruption of up to ~5 minutes before re-routing onto a healthy path.
On 2026-05-06, a change was deployed that altered the relay process. The change introduced a subtle bug that required two conditions to occur simultaneously to manifest:
When both conditions were present, the relay process would be stuck and would take minutes to fully catch up. During that time, neither endpoint of the relayed session could continue to receive media from the other.
Because both conditions are narrow (a cold-cache relay region absorbing a sudden burst of traffic), the bug did not surface during pre-deploy testing, and it did not trigger on every network blip. It only manifested when a real network disruption happened to redirect a sufficiently large burst of traffic to a relay region that had not warmed its cache. Once that occurred on a given relay, sessions flowing through it stalled until traffic shifted off that path.
We tracked down the root cause on 2026-05-20, and a fix has been fully rolled out across the fleet.
We sincerely apologize for the disruption to customers whose sessions were affected during these windows. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.