19:00 UTC: US East 1 has lost connectivity to other regions for a brief moment. Connectivity was restored within seconds. However, it has triggered an edge case causing our cross region systems to believe that the data center became isolated.
While our system is designed to handle connection disruptions to other data centers, it is designed to take itself offline when it is unable to connect to any other data center. The purpose for this design is to ensure end-users always have a path to connect to other users that are connected to the edge.
This false positive has caused US East 1 to go entirely offline. Within a minute, user traffic started being diverted to US West.
Typically this would have been minimally disruptive.
19:25 UTC: we discovered that some of the tracks published to US West was not relayed correctly to Germany. This was due to a bug in the relay codebase that prevented the server in Germany from correctly locating the track in US West.
We began troubleshooting the root cause to restore the relay links.
We understood that the bug had caused a circular relay loop to be formed, where the server in Germany was dependent on another server in France for a track that it doesn’t have.
20:17 UTC: A fix was implemented, restoring the missing relay connections and resolving the incident.