Summary
On June 2, 2026, a small subset of users connected to LiveKit Cloud US Central region experienced elevated connection failures between 12:06 and 14:38 UTC. A single node in the region entered a degraded state in which it continued to accept new connections but could not reliably complete the real-time media connection those calls depend on. As a result, a portion of participant connections, including inbound and outbound SIP calls routed through Chicago failed to connect or dropped shortly after starting. We mitigated this by routing SIP traffic away from the affected server, suspending and removing the faulty node, and restoring normal service to the region.
Timeline (UTC)
Root cause
The incident was caused by a "gray failure" of a single node in the Chicago region, a partial failure in which a server appears healthy to automated systems but is not actually functioning correctly. The server continued to be assigned new calls and reported itself as available, but could not reliably bring the underlying real-time media connections to a fully active state.
Resolution
We first drained SIP traffic out of the Chicago region to route the traffic to other regions, and once the failing node was identified, we suspended and removed the faulty node. After confirming error rates had returned to normal, we restored full SIP service to the region.
Corrective actions and prevention
We are actively addressing the two gaps that allowed a single failing node to impact the region: