Connection failures and SIP transfer errors in US Central

Incident Report for LiveKit

Postmortem

Summary

LiveKit Cloud is a globally distributed system. Our data centers continuously probe one another, and our overlay network uses those probes to choose healthy paths between regions. When a brief network disruption occurs in or between data centers, the overlay is designed to mark affected paths as unhealthy, route traffic around them, and return them to "healthy" once the underlying network has cleared.

On June 18, 2026 (US East 1) and again on June 22, 2026 (US Central), a short packet-loss event in the affected data center triggered a compound failure in this mechanism. Two separate defects, one in the overlay network and one in our link-monitoring service, interacted in a way that left the affected region effectively isolated from other regions long after the underlying network had cleared.

For most customers and most products, the impact was bounded. Our global routing automatically moved traffic to other US regions within approximately one minute of onset. SIP transfers anchored in the affected region took longer to recover but cleared within approximately thirty minutes.

A small number of LiveKit Agents running self-hosted workers experienced an extended outage in agent dispatch, up to about two hours in some cases. The majority of agent workloads continued to receive jobs normally after the routing flip. The workaround at the time was to restart the affected agent worker process so that it re-registered against a controller in a healthy region. This is a separate defect in our agent dispatch routing, and we are fixing it.

This post-mortem describes what happened, why the second incident occurred so soon after the first, and the steps we are taking to prevent recurrence.

Impact

  • Realtime connections: minimal impact. Clients automatically reconnected to alternative data centers.
  • API: approximately 1 minute of degraded availability while traffic rerouted to a nearby region.
  • SIP transfers: approximately 30 minutes for in-progress transfers anchored in the affected region.
  • Agent dispatch: approximately 1 minute for the majority of agents. A small number of agent workers saw extended dispatch outages of up to approximately 2 hours until the worker process was restarted.
  • Egress: A few number of in-progress egresses running in the affected regions were ended prematurely during the approximately 1-minute outage window, between 16:47 and 16:48 UTC on June 18, and between 12:24 and 12:25 UTC on June 22. These egresses were incorrectly marked as successful and no error was surfaced to indicate the recording had failed.

Root Cause

The trigger in both incidents was the same: a short packet-loss event in the affected data center. The expected behavior was that affected paths would be marked unhealthy briefly while the overlay rerouted traffic around them, and would return to "healthy" once the underlying network cleared.

Two separate defects interacted in a way that kept routing state stuck long after the network had recovered:

  1. Inbound and outbound freshness collapsed onto a single indicator. In the overlay network, the two directions of a connection (inbound and outbound) shared a single "freshness" indicator. When fresh data arrived in one direction, our code incorrectly assumed data in the other direction was also fresh. As a result, the affected region continued to treat its outbound connections to other regions as unreachable based on stale inbound data, even after outbound traffic was not impacted.
  2. Link-monitor key exchange could de-sync during a disruption. Our link-monitoring service relies on a key exchange between regions to validate probe traffic. The packet-loss event caused this key exchange to fall temporarily out of sync between the affected region and its neighbors. With probes failing validation, the link monitors in neighboring regions marked the affected region as "to avoid," and held that state past the actual network recovery.

Either defect on its own would not have produced an outage. Together, they produced a stuck state in both directions: neighboring regions believed they could not reach the affected region (because probe validation kept failing), and the affected region believed it could not reach the others (because the freshness collapse kept its outbound state pinned to "unreachable"). Traffic stayed routed away from the affected region until the state was manually cleared.

Following Incident 1 on June 18, we developed and merged a fix for the overlay freshness defect on the same day. The fix shipped as a new version of the overlay software and was being validated in staging at the time of Incident 2. On June 22, the same compound failure occurred on US Central, which was still running the unpatched version of the overlay. At the time of writing, the fix is now in active rollout to production.

Corrective Actions & Prevention

Immediate (in active rollout)

  • Roll the overlay freshness fix to production region-by-region. As of this writing, the rollout is underway and we expect global production coverage in the next two days. We are giving this rollout priority given the severity.

Underway

  • Improve resilience of the link-monitor probing process to key-exchange de-sync. This addresses the second of the two defects described above. Eliminating it is necessary to prevent the compound failure even after the overlay freshness fix is in place. ETA: Thursday, June 25.
  • Migrate existing agent worker connections on health changes. We correctly detected the data center as unhealthy and rerouted new traffic away quickly, but did not migrate existing agent worker connections. The change is to automatically migrate agent worker connections to healthy data centers as soon as the health indicator starts failing. ETA: June 30.
  • Make egress resilient to region disruptions. In-progress Egress should use the same mechanism available in our realtime SDK to reconnect to alternative regions. ETA: June 30.

For customers running self-hosted agent workers

Until our agent dispatch routing fix is deployed, the most reliable mitigation if you observe a dispatch outage is to restart your agent worker processes. This causes them to re-register against controllers in healthy regions.

To catch the condition early and automate the recovery, consider adding a health check that:

  • Tracks the time since the worker last received a dispatch (or last completed a session).
  • Triggers a restart of the worker process if this exceeds an expected idle window for your workload.

We recognize that this workaround places a burden on customers and is not a substitute for the underlying server-side fix. Eliminating it is a priority.

We sincerely apologize for the disruption to customers whose traffic and agent workloads were affected. Thank you for your patience, and we welcome any additional feedback from customers who were impacted.

Posted Jun 23, 2026 - 01:33 PDT

Resolved

We are resolving this incident as the mitigation is in place and successful. We have received some reports of self-hosted agents which connect to US Central needing to be restarted in order to continue receiving dispatches and will investigate opportunities to improve this behavior.

Apologies for the disruptions and we will follow up with a postmortem as soon as possible.
Posted Jun 22, 2026 - 07:40 PDT

Update

Update: As soon as the US Central region became unresponsive around 12:25 UTC, traffic was automatically re-routed to the nearest healthy region. Customers may have noticed failed API requests for approximately 1 minute while the re-route took place. Failed SIP Transfers continued until we finished manually draining the impacted region at 13:09 UTC.

We are also investigating impact to Agent Dispatches and will post another update when we know more.
Posted Jun 22, 2026 - 06:59 PDT

Monitoring

We have begun routing traffic away from US Central after our monitoring system triggered due to a high API error rate. Customer impact may include WebRTC connection failures and SIP transfer errors.
Posted Jun 22, 2026 - 06:05 PDT
This incident affected: Regional SIP (US Central - SIP) and Regional Real Time Communication (US Central - Real Time Communication).