This incident was reported directly by customers and was not detected by our internal monitoring systems. Customer-reported issues of this nature are particularly concerning because they indicate gaps in our observability. Our monitoring failed to identify the failure and therefore did not trigger automated alerting or incident response. We sincerely apologize to affected customers and take this detection failure very seriously. We are committed to doing better.
The root cause of connection hanging was due to degradation in two of our Network Load Balancers (NLB) in US East, resulting in a percentage of incoming HTTPS connections hanging before reaching our backend servers.
Most of our client SDKs and applications (Web, mobile, Go SDK, etc.) have built-in timeouts and retry in order to survive failure modes like this. For these clients, a hanging initial connection would typically timeout quickly, followed by successful retries on subsequent attempts, effectively masking the underlying problem for the majority of users.
The Rust SDK (and other SDKs built on the Rust core) was impacted much more severely. While it implements retries, it did not enforce a connection timeout on the initial attempt. This allowed connections to hang for much longer in affected cases, leading to noticeable stalls and degraded user experience for Rust-based clients.
The primary reason this incident evaded detection was that our end-to-end monitoring included probes using the JavaScript and Go SDKs, both of which gracefully handled the hanging connections via timeouts and retries. This created a blind spot for the specific failure mode.
(all times in UTC, 2026-02-14)
17:00 - Received report from customer that a percentage of connections were unsuccessful
17:10 - We started to investigate the reports
17:20 - We've confirmed that our connection tests are passing, and error/warning rate do not look elevated
17:25 - Team concludes (prematurely) that no widespread issue exists
20:45 - Direct testing by IP confirms that two NLBs in US East are hanging on a ~8% of requests
20:50 - On-call engineer paged to troubleshoot load balancers
22:00 - Traffic fully diverted from the degraded load balancers; service recovered
24:00 - Created new load balancers to replace the faulty ones
The following improvements have been implemented or initiated to reduce the likelihood and impact of similar incidents:
We will continue to expand failure-mode-specific monitoring (beyond SDK-based probes) and periodically validate that our alerting covers realistic client behaviors across all major SDKs.
Thank you for your patience and understanding. We appreciate any additional feedback from customers who were affected.