Connectivity issues in US East 1
Incident Report for LiveKit
Postmortem

Incident Timeline

19:00 UTC: US East 1 has lost connectivity to other regions for a brief moment. Connectivity was restored within seconds. However, it has triggered an edge case causing our cross region systems to believe that the data center became isolated.

While our system is designed to handle connection disruptions to other data centers, it is designed to take itself offline when it is unable to connect to any other data center. The purpose for this design is to ensure end-users always have a path to connect to other users that are connected to the edge.

This false positive has caused US East 1 to go entirely offline. Within a minute, user traffic started being diverted to US West.

Typically this would have been minimally disruptive.

19:25 UTC: we discovered that some of the tracks published to US West was not relayed correctly to Germany. This was due to a bug in the relay codebase that prevented the server in Germany from correctly locating the track in US West.

We began troubleshooting the root cause to restore the relay links.

We understood that the bug had caused a circular relay loop to be formed, where the server in Germany was dependent on another server in France for a track that it doesn’t have.

20:17 UTC: A fix was implemented, restoring the missing relay connections and resolving the incident.

Remediation Steps

  1. False Positive Detection: We will address the root cause of the false positive detection of connectivity issues, refining our isolation detection mechanisms to avoid unnecessary shutdowns.
  2. Relay System Robustness: Enhancements will be made to the relay system to prevent the formation of circular relay loops, ensuring a more reliable track relay process across regions.
  3. Monitoring Sensitivity: Our monitoring system's sensitivity will be updated to detect potential relay problems more promptly, aiming for quicker identification and resolution of similar issues.
Posted Feb 07, 2024 - 23:41 PST

Resolved
This issue has been resolved. We will provide a post mortem in the coming days.
Posted Feb 07, 2024 - 15:49 PST
Update
We are continuing to monitor for any further issues.
Posted Feb 07, 2024 - 14:15 PST
Monitoring
We have resolved the issues with the affected datacenter. The datacenter has been restored and is now serving traffic. We will continue monitoring for other issues.
Posted Feb 07, 2024 - 14:13 PST
Identified
We have identified the issues and are working on mitigation.
Posted Feb 07, 2024 - 12:10 PST
Investigating
We are currently investigating this issue in US East 1
Posted Feb 07, 2024 - 11:08 PST
This incident affected: Regional TURN (US East 1 - TURN) and Regional Real Time Communication (US East 1 - Real Time Communication).