Root Cause
A code change unintentionally overrode the user-supplied ringing_timeout for synchronous CreateSIPParticipant API calls. As a result, calls with wait_until_answered=true timed out significantly earlier than intended and failed prematurely.
Widespread impact lasted for over 2 hours because it evaded our automated tests and monitoring systems. Below, we share some technical details behind the failure to provide transparency.
Technical Details
The CreateSIPParticipant API, used for making outbound calls, supports two modes of operation:
200 immediately.wait_until_answered is set to true, the API holds the connection open until the user answers or declines the call.In sync mode, the API returns a 200 status only if the user answers the call, or a 408 if they do not respond within the user-defined ringing_timeout (which defaults to 30 seconds).
In an effort to enhance observability in our RPC stack, we've recently introduced a change to enable tracing for internal RPC calls. This automatically exports traces to our observability platform whenever an internal RPC is invoked.
However, enabling this tracing also unintentionally introduced a default timeout of 3 seconds on internal RPCs. Consequently, two competing timeouts came into play during CreateSIPParticipant calls:
ringing_timeout: User-defined, defaults to 30 seconds.These inconsistent timeouts caused internal APIs to return a 408 error before the ringing_timeout was reached. As a result, SIP outbound calls with wait_until_answered=true would ring for only three seconds before aborting. Calls answered within 3 seconds or those in async mode proceeded without issues.
Detection and Response Challenges
Service reliability is our top priority; we maintain rigorous testing and alerting systems, including:
Despite these measures, the issue went undetected during deployment for the following reasons:
CreateSIPParticipant to return a 408, before it has reached ringing_timeout.Timeline
2026-01-08 04:47 UTC – Change first deployed to a limited set of low-traffic regions.
2026-01-08 15:40 UTC – Second rollout phase to regions including Asia.
2026-01-08 20:45 UTC – Third rollout phase to additional regions, including EU.
2026-01-09 07:38 UTC – Change deployed to the majority of regions, resulting in widespread impact.
2026-01-09 09:58 UTC – Change fully rolled back, resolving the issue.
Scope of Impact
During the incident window, outbound calls meeting the following conditions failed:
wait_until_answered=true Mitigations and Follow-ups
To prevent similar issues in the future, we are implementing the following:
We appreciate your understanding and are committed to continuously improving our platform's reliability. If you have any questions or feedback, please reach out to our support team.