Our servers rely on an internal service to fetch webhook destinations for each event that we attempt to deliver (we will call this service X). This service is based on grpc. Webhook delivery information is typically cached on each server for a period of time and refreshed periodically. The cache ensures that we can tolerate short period of upstream service unavailability, for example, temporary network disruptions.
On 5/29, there was an internal disruption in the Kubernetes stack. This had caused service X to be unavailable momentarily (likely <20s). Typically, this would mean that our cache layer would step in and provide that information. However, this particular blip had caused the grpc clients to become stuck while making the call, and they did not timeout or return any exceptions. Upon further research, we discovered that the default behavior of go-grpc is to wait forever for a response in certain scenarios, without timing out.
Because of the default behavior of go-grpc, and that the particular way that the disruption had occurred, all webhook delivery on impacted servers had become stuck, until those servers are rebooted.
When we identified the issue, the team proceeded to create queries in order to identify the problematic servers in the fleet. As they were identified, we’ve proceeded to reboot each instance until the issue was fully resolved.
After the incident, we’ve completed the following steps to prevent similar incidents in the future