8% of RTC connections hanging in US-East

Incident Report for LiveKit

Postmortem

Summary

This incident was reported directly by customers and was not detected by our internal monitoring systems. Customer-reported issues of this nature are particularly concerning because they indicate gaps in our observability. Our monitoring failed to identify the failure and therefore did not trigger automated alerting or incident response. We sincerely apologize to affected customers and take this detection failure very seriously. We are committed to doing better.

Root cause

The root cause of connection hanging was due to degradation in two of our Network Load Balancers (NLB) in US East, resulting in a percentage of incoming HTTPS connections hanging before reaching our backend servers.

Most of our client SDKs and applications (Web, mobile, Go SDK, etc.) have built-in timeouts and retry in order to survive failure modes like this. For these clients, a hanging initial connection would typically timeout quickly, followed by successful retries on subsequent attempts, effectively masking the underlying problem for the majority of users.

The Rust SDK (and other SDKs built on the Rust core) was impacted much more severely. While it implements retries, it did not enforce a connection timeout on the initial attempt. This allowed connections to hang for much longer in affected cases, leading to noticeable stalls and degraded user experience for Rust-based clients.

The primary reason this incident evaded detection was that our end-to-end monitoring included probes using the JavaScript and Go SDKs, both of which gracefully handled the hanging connections via timeouts and retries. This created a blind spot for the specific failure mode.

Incident timeline

(all times in UTC, 2026-02-14)

17:00 - Received report from customer that a percentage of connections were unsuccessful
17:10 - We started to investigate the reports
17:20 - We've confirmed that our connection tests are passing, and error/warning rate do not look elevated
17:25 - Team concludes (prematurely) that no widespread issue exists
20:45 - Direct testing by IP confirms that two NLBs in US East are hanging on a ~8% of requests
20:50 - On-call engineer paged to troubleshoot load balancers
22:00 - Traffic fully diverted from the degraded load balancers; service recovered
24:00 - Created new load balancers to replace the faulty ones

Corrective Actions & Prevention

The following improvements have been implemented or initiated to reduce the likelihood and impact of similar incidents:

Added proper connection timeouts to initial requests: https://github.com/livekit/rust-sdks/pull/895
Added dedicated, synthetic continuous monitoring for every individual load balancer (health checks that are independent of SDK retry behavior)
Opened a detailed root-cause investigation with our cloud provider regarding the NLB degradation. We are working with them to improve upstream detection, telemetry, and handling of similar failure modes.

We will continue to expand failure-mode-specific monitoring (beyond SDK-based probes) and periodically validate that our alerting covers realistic client behaviors across all major SDKs.

Thank you for your patience and understanding. We appreciate any additional feedback from customers who were affected.

Posted Feb 20, 2026 - 01:26 PST

Resolved

We received customer reports indicating that some agent connections to LiveKit Cloud were timing out. The symptoms typically appeared as a networking error, with calls hanging indefinitely on `room.connect()`

Investigation confirmed the root cause: degradation on two Network Load Balancers (NLBs) in the US-East region.

This issue affected approximately 8% of inbound agent connections to US-East during the impacted period.

Posted Feb 14, 2026 - 08:00 PST