Users unable to recieve media

Incident Report for LiveKit

Postmortem

This incident was due to inadequate backwards compatibility in RoomService code. We had deployed a change that fixed a few bugs with RoomService. The fixes relied on changes in both RoomService API as well as the media instances. When the change was deployed, it was deployed to RoomService instances immediately; but the change to media instances were deployed in canary mode in a single region.

When CreateRoom was called in that region, it caused the media node handling the request to panic.

LiveKit clients are built to handle resume & reconnection. So when the media node crashed, participants are automatically migrated to a new instance, causing a moment of pause in streams in that region. The disruption should have been short and recovered automatically.

To mitigate future disruption to service, we’ll ensure that service and media changes are always backwards compatible for at least a version.

Posted Nov 19, 2022 - 00:01 PST

Resolved

This incident has been resolved.

Posted Nov 16, 2022 - 12:11 PST

Update

We have identified the issue and deployed a fix. We will continue to monitor for issues.

Posted Nov 16, 2022 - 11:33 PST

Investigating

We are currently investigating the issue.

Posted Nov 16, 2022 - 11:07 PST

This incident affected: Global Real Time Communication.