SIP outbound call issues

Incident Report for LiveKit

Postmortem

Root Cause

A code change unintentionally overrode the user-supplied ringing_timeout for synchronous CreateSIPParticipant API calls. As a result, calls with wait_until_answered=true timed out significantly earlier than intended and failed prematurely.

Widespread impact lasted for over 2 hours because it evaded our automated tests and monitoring systems. Below, we share some technical details behind the failure to provide transparency.

Technical Details

The CreateSIPParticipant API, used for making outbound calls, supports two modes of operation:

  • Async mode: This is the default behavior, where the API dials the call and returns 200 immediately.
  • Sync mode: When wait_until_answered is set to true, the API holds the connection open until the user answers or declines the call.

In sync mode, the API returns a 200 status only if the user answers the call, or a 408 if they do not respond within the user-defined ringing_timeout (which defaults to 30 seconds).

In an effort to enhance observability in our RPC stack, we've recently introduced a change to enable tracing for internal RPC calls. This automatically exports traces to our observability platform whenever an internal RPC is invoked.

However, enabling this tracing also unintentionally introduced a default timeout of 3 seconds on internal RPCs. Consequently, two competing timeouts came into play during CreateSIPParticipant calls:

  • ringing_timeout: User-defined, defaults to 30 seconds.
  • internal RPC timeout: Fixed at 3 seconds.

These inconsistent timeouts caused internal APIs to return a 408 error before the ringing_timeout was reached. As a result, SIP outbound calls with wait_until_answered=true would ring for only three seconds before aborting. Calls answered within 3 seconds or those in async mode proceeded without issues.

Detection and Response Challenges

Service reliability is our top priority; we maintain rigorous testing and alerting systems, including:

  • Continuous end-to-end tests running against both staging and production environments.
  • Phased deployment across our global infrastructure, starting with low-traffic regions.
  • Alarms triggered by high error rates (5xx) on customer-facing API calls.
  • Alarms for elevated internal RPC error rates.
  • Manual review of key health indicators during deployments.

Despite these measures, the issue went undetected during deployment for the following reasons:

  • The 3s timeout has caused CreateSIPParticipant to return a 408, before it has reached ringing_timeout.
  • Initial rollout regions had insufficient users relying on sync mode, so their calls completed without disruption.
  • Our end-to-end tests simulate actual calls but use a bot on the receiving end, which answers within 3 seconds.
  • SIP health indicators showed calls being made and completing overall (though average call duration dropped during the incident, it was not monitored as a key health metric).

Timeline

2026-01-08 04:47 UTC – Change first deployed to a limited set of low-traffic regions.

2026-01-08 15:40 UTC – Second rollout phase to regions including Asia.

2026-01-08 20:45 UTC – Third rollout phase to additional regions, including EU.

2026-01-09 07:38 UTC – Change deployed to the majority of regions, resulting in widespread impact.

2026-01-09 09:58 UTC – Change fully rolled back, resolving the issue.

Scope of Impact

During the incident window, outbound calls meeting the following conditions failed:

  • wait_until_answered=true 
  • User did not answer within 3s

Mitigations and Follow-ups

To prevent similar issues in the future, we are implementing the following:

  • A more robust design for managing internal and system-level timeouts, scheduled for rollout within the next week.
  • Updates to end-to-end testing to include scenarios with longer delays before call pickup.
  • Addition of call duration as a key health indicator in our monitoring dashboards.

We appreciate your understanding and are committed to continuously improving our platform's reliability. If you have any questions or feedback, please reach out to our support team.

Posted Jan 10, 2026 - 11:31 PST

Resolved

This incident has been resolved. It is our highest priority to understand the root cause and will share a full post mortem shortly
Posted Jan 09, 2026 - 02:24 PST

Update

We are continuing to investigate this issue. (SIP Global)
Posted Jan 09, 2026 - 02:24 PST

Update

We have mitigated the issue by rolling back a media service release that went out around the same time. The time of impact was about 07:41 - 10:01 UTC. (SIP Global)
Posted Jan 09, 2026 - 02:23 PST

Investigating

We are investigating customer reports regarding SIP outbound calls. Inbound calls are not affected. We'll provide more details soon. (SIP Global)
Posted Jan 09, 2026 - 02:00 PST
This incident affected: Regional SIP (US West - SIP, US East - SIP, US Central - SIP).