This incident was due to inadequate backwards compatibility in RoomService code. We had deployed a change that fixed a few bugs with RoomService. The fixes relied on changes in both RoomService API as well as the media instances. When the change was deployed, it was deployed to RoomService instances immediately; but the change to media instances were deployed in canary mode in a single region.
When CreateRoom was called in that region, it caused the media node handling the request to panic.
LiveKit clients are built to handle resume & reconnection. So when the media node crashed, participants are automatically migrated to a new instance, causing a moment of pause in streams in that region. The disruption should have been short and recovered automatically.
To mitigate future disruption to service, we’ll ensure that service and media changes are always backwards compatible for at least a version.