US - Issues with Track Egress starts

Incident Report for LiveKit

Postmortem

Summary

Track Egress experienced intermittent failures and delayed status updates caused by RPC instability between the egress and controller services in US regions (primarily Phoenix and Chicago). Room Composite egresses saw a much smaller impact. Other egress types were not broadly affected.

Customer impact (initial estimates during window 2025-11-10 14:00–18:30 UTC):

  • ~2.25% of Track Egress start requests failed
  • ~0.0175% of Room Composite starts failed
  • ~0.75% of egresses had missing or delayed status updates

The incident was caused by a bug in the egress RPC client causing RPCs to fail in come conditions, affecting status updates and egress service availability.

Timeline

Timestamps are in UTC

  • 14:00 Some TrackEgress request start failing. Some successful egresses never reach the COMPLETE status in the cloud dashboard
  • 14:14: First alert for increased egress start latency. Investigation starts
  • 18:00: Issue is identified as RPC failures preventing some egress instances from updating the egress state other or servicing new egress requests
  • 18:15: New egress instances are brought up to replace the failed ones, mitigating the outage. Customer impact ends.
  • Nov 12: Underlying bug in RPC client is identified and deployed to the egress cluster.

Root Cause Analysis

The LiveKit infrastructure relies on controller nodes to dispatch requests to egress nodes, and to update the stored Egress status. A RPC mechanism is used to transport messages between these 2 services. A bug in the egress RPC client caused it to rarely get into a bad state, preventing it from sending new RPC messages. This means that the egress instance would be unable to update an egress request status, or to start servicing new requests. This would in turn cause the egress cluster to run out of capacity.

Mitigations

The failed egress instances were drained and replaced with new ones. These new instances were monitored to ensure thay did not get into the failed state. The underlying issue in the RPC implementation was identified and corrected. We also added a watchdog into egress instances to automatically replace them with new ones such an issue were to occur again.

Posted Nov 14, 2025 - 09:25 PST

Resolved

This incident was resolved as of 10:30 am PST and we have not observed any additional errors since then. We will update this incident with more details on the impact that we observed.
Posted Nov 10, 2025 - 10:30 PST

Investigating

We are currently investigating reports of track egresses failing to start in our US West and US Central regions. Other egress types are not impacted.
Posted Nov 10, 2025 - 08:29 PST
This incident affected: Regional Egress (US West - Egress, US East - Egress, US Central - Egress).