Increased connection errors

Incident Report for LiveKit

Postmortem

May 6th 17:17 UTC - Deployment of relay protocol v2:
This update enabled relaying directly without needing the overlay network, a first step towards automatically bypassing regional connectivity issues. The change had undergone thorough validation in staging and preprod for three weeks, including load testing with 20K concurrent participants in a session.
We deployed this update during a period of historically low traffic across our clusters.

May 7th 15:48 UTC - Media Node Shutdowns and Investigation:
We noticed media nodes shutting down due to a lock, which forced some users to migrate to new instances. Our team started looking into the alarms.

May 7th 15:54 UTC - Partial Outage and Increased Error Rates:
Media nodes were failing faster than they could recover in certain regions, causing higher error rates for the global load balancer and a partial outage for some users.

May 7th 15:59 UTC - Reverting to Previous Build and Service Restoration:
Suspecting the issue was related to the previous day's deployment, we reverted to the earlier build. Error rates began to drop.

May 7th 16:16 UTC - Ongoing Periodic Node Failures:
Although service had been restored, we continued to observe occasional node failures, leading to minor service impacts and users migrated to healthy instances when it took place.

May 7th 18:00 UTC - Resolution of Periodic Node Failures:
The node failures were traced back to a few older instances still running the relay v2 code. They were put into draining mode during rollback, but users connected to them were not disrupted. After gracefully terminating these instances, the periodic node failures were resolved.

In the hours following the incident, the team identified the root cause as being related to the relay v2 change. The new relay protocol altered the connection establishment from asynchronous to synchronous, which, when combined with the lock, prevented the media server from accepting new requests while initializing relay connections. This issue did not manifest in staging and preprod for a couple of reasons:

The regions in staging and preprod are closer together, resulting in lower cross-region latency.
Production has more relay traffic per server, causing the lock to be held longer.

We believe these factors contributed to the problem becoming apparent on Sunday when traffic patterns shifted, pushing the lock duration above a threshold that prevented new connections from being established.

To avoid similar incidents from happening in the future, we will be adding steps to our validation processes. Changes will include network latency simulation as well as production traffic simulation prior to deploying changes to the core routing stack.

Posted May 08, 2023 - 14:09 PDT

Resolved

Increased connection errors to the global load balancer for about 5 minutes. We've identified the issue to be related to a deployment on Saturday. The deployment was rolled back and disruptions have subsided.

Post mortem to come.

Posted May 07, 2023 - 09:00 PDT