March 8th at 05:10 UTC: We received alarms that there had been a sudden spike in memory usage for our message bus in the US East 1 data center. Memory usage was three times higher than usual and continued to grow.
05:12 UTC: Some media SFUs began failing due to unsuccessful health checks.
05:14 UTC: The surge in memory usage caused our message bus instances to experience Out-of-Memory (OOM) crashes. All bus instances had similar memory usage and growth rates. While we provide redundant copies of the message bus service, all instances were getting OOM killed around the same time. This resulted in a complete failure of the bus system in US East 1.
05:15 UTC: Our failover mechanism activated, redirecting clients to the US West region.
The team continued investigating the cause of the increased memory usage.
06:02 UTC: When traffic was redirected to US West, it began to fire identical memory alarms and exhibited similar patterns of memory growth.
06:05 UTC: We increased the number of message bus instances in both data centers (US East 1 and US West) and tripled memory allocation.
06:20 UTC: Service has fully recovered, and client connections stabilized.
Following the incident, both US East and West data centers maintained stability. We continued to investigate the root cause of the increased memory usage patterns.
Contributing factors:
We will be taking the following steps to prevent this and similar issues from occurring:
These steps are designed to mitigate the identified issues and improve the overall resiliency and efficiency of our systems.