RPC issue

Incident Report for LiveKit

Postmortem

The Incident

We had an outage caused by a misconfigured internal RPC. It prevented servers from receiving media from another within the mesh network. We apologize for any incoveniences caused by this outage. While LiveKit Cloud is architected to withstand hardware and network link failures, this incident resulted from an operator error during deployment.

On November 2nd at 03:06 UTC, we deployed a hotfix intended to rectify a bug causing intermittent server reboots. Server reboots would create a brief disruption to those that are connected to that server.

By 03:15 UTC, the change was fully deployed. Shortly after that, we observed a significant rate of incoming sessions were failing to acquire tracks from other servers. At that point, we attributed the failure to the latest build and began to perform a rollback process to the previous deployment.

Majority of data centers were fully operational by 03:28 UTC. US West continued to experience failures until 03:41 UTC.

Root cause

The hotfix inadvertently included a change to our internal RPC protocol. This change was supposed to be deployed with an accompanying update to the configuration file. Without proper configuration, it would be left stuck waiting to acquire tracks from media servers that did not have the track.

As a result, our servers could not acquire media tracks from other servers, disrupting a key function of the mesh network.

This change had been validated in our canary environment without error. This is because our current canary deployment is in a single region, which did not allow us to surface timeouts that would typically be seen across multiple data centers.

Remediation plan

To tighten our deployment process, we will:

Improved process for deployments

Ensure hotfixes are minimal and directly address the issue
Mandate multiple sign-offs for deploys

Enhance canary process

Expand canary deployment to additional regions
Implement blue/green deployments for gradual traffic transition

Posted Nov 02, 2023 - 23:13 PDT

Resolved

Servers unable to communicate with each other. At 03:15 UTC, a change was deployed to our servers that disrupted server-to-server communication. Our monitoring systems detected the issue and engineers were alerted. The offending change was rolled back by 03:28 UTC, resulting in 13-minutes of outage. We are taking this incident very seriously and will share a comprehensive post-mortem detailing process improvements to prevent such occurrences in the future.

Posted Nov 01, 2023 - 20:00 PDT