Service Disruption
Incident Report for LiveKit
Postmortem

Incident Timeline

March 8th at 05:10 UTC: We received alarms that there had been a sudden spike in memory usage for our message bus in the US East 1 data center. Memory usage was three times higher than usual and continued to grow.

05:12 UTC: Some media SFUs began failing due to unsuccessful health checks.

05:14 UTC: The surge in memory usage caused our message bus instances to experience Out-of-Memory (OOM) crashes. All bus instances had similar memory usage and growth rates. While we provide redundant copies of the message bus service, all instances were getting OOM killed around the same time. This resulted in a complete failure of the bus system in US East 1.

05:15 UTC: Our failover mechanism activated, redirecting clients to the US West region.

The team continued investigating the cause of the increased memory usage.

06:02 UTC: When traffic was redirected to US West, it began to fire identical memory alarms and exhibited similar patterns of memory growth.

06:05 UTC: We increased the number of message bus instances in both data centers (US East 1 and US West) and tripled memory allocation.

06:20 UTC: Service has fully recovered, and client connections stabilized.

Following the incident, both US East and West data centers maintained stability. We continued to investigate the root cause of the increased memory usage patterns.

Root Cause Analysis

Contributing factors:

  1. Increased Traffic: Recent weeks have seen a rise in concurrent users, leading to a higher number of subscriptions to pub/sub topics within our internal systems.
  2. Large Payloads: Our message bus system facilitates communication between data centers. Among other tasks, it synchronizes the list of subscriptions (pub/sub topics) across data centers. Increased traffic (1) lead to substantially larger serialized subscription data payloads that took a nontrivial amount of time to synchronize.
  3. Network Deterioration: Network transmission speed between the US East and Sydney data centers deteriorated prior to the incident, leading to even longer data transmission times. This caused (larger) subscription payloads originating in US East and destined for Sydney to pile up in memory, increasing memory usage.
  4. Memory Leak: Payloads that were not transmitted before the timeout leaked memory, leading to continuous memory growth.

Remediation Steps

We will be taking the following steps to prevent this and similar issues from occurring:

  1. Adjust the transmission timeout settings to reduce the likelihood of subscription transmission failures
  2. Reduce our use of pub/sub topics by a factor of 10.
  3. Address the memory leak issue within the message bus synchronization process to ensure memory is properly reclaimed after failed transmission attempts.

These steps are designed to mitigate the identified issues and improve the overall resiliency and efficiency of our systems.

Posted Mar 08, 2024 - 21:23 PST

Resolved
All systems continued to remain stable. We'll provide a full post-mortem tomorrow.
Posted Mar 07, 2024 - 23:09 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 07, 2024 - 22:18 PST
Investigating
We have identified an outage and disruption with our US regions. We are continuing to investigate and applying mitigations.
Posted Mar 07, 2024 - 22:12 PST
This incident affected: Global Real Time Communication and Regional Real Time Communication (US West - Real Time Communication, US East 1 - Real Time Communication).