LiveKit Cloud is a globally distributed system. Our data centers continuously probe one another, and our overlay network uses those probes to choose healthy paths between regions. When a brief network disruption occurs in or between data centers, the overlay is designed to mark affected paths as unhealthy, route traffic around them, and return them to "healthy" once the underlying network has cleared.
On June 18, 2026 (US East 1) and again on June 22, 2026 (US Central), a short packet-loss event in the affected data center triggered a compound failure in this mechanism. Two separate defects, one in the overlay network and one in our link-monitoring service, interacted in a way that left the affected region effectively isolated from other regions long after the underlying network had cleared.
For most customers and most products, the impact was bounded. Our global routing automatically moved traffic to other US regions within approximately one minute of onset. SIP transfers anchored in the affected region took longer to recover but cleared within approximately thirty minutes.
A small number of LiveKit Agents running self-hosted workers experienced an extended outage in agent dispatch, up to about two hours in some cases. The majority of agent workloads continued to receive jobs normally after the routing flip. The workaround at the time was to restart the affected agent worker process so that it re-registered against a controller in a healthy region. This is a separate defect in our agent dispatch routing, and we are fixing it.
This post-mortem describes what happened, why the second incident occurred so soon after the first, and the steps we are taking to prevent recurrence.
The trigger in both incidents was the same: a short packet-loss event in the affected data center. The expected behavior was that affected paths would be marked unhealthy briefly while the overlay rerouted traffic around them, and would return to "healthy" once the underlying network cleared.
Two separate defects interacted in a way that kept routing state stuck long after the network had recovered:
Either defect on its own would not have produced an outage. Together, they produced a stuck state in both directions: neighboring regions believed they could not reach the affected region (because probe validation kept failing), and the affected region believed it could not reach the others (because the freshness collapse kept its outbound state pinned to "unreachable"). Traffic stayed routed away from the affected region until the state was manually cleared.
Following Incident 1 on June 18, we developed and merged a fix for the overlay freshness defect on the same day. The fix shipped as a new version of the overlay software and was being validated in staging at the time of Incident 2. On June 22, the same compound failure occurred on US Central, which was still running the unpatched version of the overlay. At the time of writing, the fix is now in active rollout to production.
Immediate (in active rollout)
Underway
Until our agent dispatch routing fix is deployed, the most reliable mitigation if you observe a dispatch outage is to restart your agent worker processes. This causes them to re-register against controllers in healthy regions.
To catch the condition early and automate the recovery, consider adding a health check that:
We recognize that this workaround places a burden on customers and is not a substitute for the underlying server-side fix. Eliminating it is a priority.
We sincerely apologize for the disruption to customers whose traffic and agent workloads were affected. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.