Missing Webhooks for events

Incident Report for LiveKit

Postmortem

Timeline

  • 10am UTC: In US East 1, about ~50% of our servers stopped sending webhook events
  • 12:39pm UTC: Customer reports that webhooks aren’t getting delivered
  • 12:52pm UTC: Our team started an investigation of the event
  • 13:35pm UTC: We identified the issue with specific servers failing to deliver webhooks, and started to cycle them out
  • 15:10pm UTC: All affected instances have been removed, webhook delivery was back to normal

RCA

Our servers rely on an internal service to fetch webhook destinations for each event that we attempt to deliver (we will call this service X). This service is based on grpc. Webhook delivery information is typically cached on each server for a period of time and refreshed periodically. The cache ensures that we can tolerate short period of upstream service unavailability, for example, temporary network disruptions.

On 5/29, there was an internal disruption in the Kubernetes stack. This had caused service X to be unavailable momentarily (likely <20s). Typically, this would mean that our cache layer would step in and provide that information. However, this particular blip had caused the grpc clients to become stuck while making the call, and they did not timeout or return any exceptions. Upon further research, we discovered that the default behavior of go-grpc is to wait forever for a response in certain scenarios, without timing out.

Because of the default behavior of go-grpc, and that the particular way that the disruption had occurred, all webhook delivery on impacted servers had become stuck, until those servers are rebooted.

When we identified the issue, the team proceeded to create queries in order to identify the problematic servers in the fleet. As they were identified, we’ve proceeded to reboot each instance until the issue was fully resolved.

Remediation

After the incident, we’ve completed the following steps to prevent similar incidents in the future

  • monitors were put in to detect the condition and to alert oncall
  • audited our grpc usage in internal services and added timeouts where appropriate
Posted May 30, 2025 - 18:48 PDT

Resolved

This incident has been resolved.
Posted May 29, 2025 - 11:24 PDT

Monitoring

We have received reports that webhook events are missing for some events. On investigation, this behaviour started from 10 AM UTC. We have identified certain components in our system that were stuck in an incorrect state, and have terminated them. We are continuing to monitor for more components stuck in such a state, and investigating the issue.
Posted May 29, 2025 - 07:50 PDT