On Monday, around 4:37PT, the relay connections between US East 1 and US East 2 were lost, causing users connected to US East 2 to be unable to subscribe to tracks published to US East 1. This incident was caused by a bug with our track routing system (a component named Director). Director instances were restarted around 4:37PT. Typically when restarts take place, they would pick up where they've left off normally. However, on Monday, an edge case was triggered that caused Director instances to believe it was unable to connect to the other regions. This condition lasted around 4 minutes, before automatically recovering.
We are making a couple of improvements in order to prevent similar situations from occurring:
1. We'll add additional safeguards in the code to prevent Director from triggering the disconnected state incorrectly.
2. From a process stand-point, this could have been prevented with staggered restarts, ensuring that each region only applies restarts/deploys during low traffic hours.