Track Egress experienced intermittent failures and delayed status updates caused by RPC instability between the egress and controller services in US regions (primarily Phoenix and Chicago). Room Composite egresses saw a much smaller impact. Other egress types were not broadly affected.
Customer impact (initial estimates during window 2025-11-10 14:00–18:30 UTC):
The incident was caused by a bug in the egress RPC client causing RPCs to fail in come conditions, affecting status updates and egress service availability.
Timestamps are in UTC
The LiveKit infrastructure relies on controller nodes to dispatch requests to egress nodes, and to update the stored Egress status. A RPC mechanism is used to transport messages between these 2 services. A bug in the egress RPC client caused it to rarely get into a bad state, preventing it from sending new RPC messages. This means that the egress instance would be unable to update an egress request status, or to start servicing new requests. This would in turn cause the egress cluster to run out of capacity.
The failed egress instances were drained and replaced with new ones. These new instances were monitored to ensure thay did not get into the failed state. The underlying issue in the RPC implementation was identified and corrected. We also added a watchdog into egress instances to automatically replace them with new ones such an issue were to occur again.