Degraded Connectivity in US Central

Incident Report for LiveKit

Postmortem

Summary
On June 2, 2026, a small subset of users connected to LiveKit Cloud US Central region experienced elevated connection failures between 12:06 and 14:38 UTC. A single node in the region entered a degraded state in which it continued to accept new connections but could not reliably complete the real-time media connection those calls depend on. As a result, a portion of participant connections, including inbound and outbound SIP calls routed through Chicago failed to connect or dropped shortly after starting. We mitigated this by routing SIP traffic away from the affected server, suspending and removing the faulty node, and restoring normal service to the region.

Timeline (UTC)

  • 12:06 — Start of measurable customer impact
  • 14:04 — Chicago SIP traffic drained as mitigation
  • 14:38 — Problematic server suspended; customer impact ends
  • 15:07 — Chicago SIP service fully restored

Root cause
The incident was caused by a "gray failure" of a single node in the Chicago region, a partial failure in which a server appears healthy to automated systems but is not actually functioning correctly. The server continued to be assigned new calls and reported itself as available, but could not reliably bring the underlying real-time media connections to a fully active state.

Resolution
We first drained SIP traffic out of the Chicago region to route the traffic to other regions, and once the failing node was identified, we suspended and removed the faulty node. After confirming error rates had returned to normal, we restored full SIP service to the region.

Corrective actions and prevention
We are actively addressing the two gaps that allowed a single failing node to impact the region:

  • Error aversion in load balancing: configuring our load balancing servers to automatically detect and steer traffic away from nodes exhibiting this class of failure.
  • Monitoring: closing the alerting gap so that "gray failure" servers which appear healthy but cannot complete connections are detected and alerted promptly.
Posted Jun 04, 2026 - 07:03 PDT

Resolved

This incident has been resolved.
Posted Jun 02, 2026 - 08:18 PDT

Update

We've identified the root cause as a single problematic media node in US Central, which has been suspended at 14:38 UTC. We are continuing to monitor before marking this resolved.
Posted Jun 02, 2026 - 07:58 PDT

Monitoring

We have routed the traffic away from the US Central region at 14:00 UTC and are seeing the connection failures returning to normal levels. We are continuing to monitor the issue.
Posted Jun 02, 2026 - 07:17 PDT

Investigating

We're investigating elevated SIP call connection failures in our US Central region beginning at ~12:06 UTC. We are working to mitigate this issue.
Posted Jun 02, 2026 - 07:05 PDT
This incident affected: Regional SIP (US Central - SIP).