We had an outage caused by a misconfigured internal RPC. It prevented servers from receiving media from another within the mesh network. We apologize for any incoveniences caused by this outage. While LiveKit Cloud is architected to withstand hardware and network link failures, this incident resulted from an operator error during deployment.
On November 2nd at 03:06 UTC, we deployed a hotfix intended to rectify a bug causing intermittent server reboots. Server reboots would create a brief disruption to those that are connected to that server.
By 03:15 UTC, the change was fully deployed. Shortly after that, we observed a significant rate of incoming sessions were failing to acquire tracks from other servers. At that point, we attributed the failure to the latest build and began to perform a rollback process to the previous deployment.
Majority of data centers were fully operational by 03:28 UTC. US West continued to experience failures until 03:41 UTC.
The hotfix inadvertently included a change to our internal RPC protocol. This change was supposed to be deployed with an accompanying update to the configuration file. Without proper configuration, it would be left stuck waiting to acquire tracks from media servers that did not have the track.
As a result, our servers could not acquire media tracks from other servers, disrupting a key function of the mesh network.
This change had been validated in our canary environment without error. This is because our current canary deployment is in a single region, which did not allow us to surface timeouts that would typically be seen across multiple data centers.
To tighten our deployment process, we will:
Improved process for deployments
Enhance canary process