Incident
On May 6, around 04:00 UTC we started to receive internal alarms about cluster availability from a single cluster in our us-east hosted agents region. Upon investigation we found that etcd was unresponsive for this cluster. Without etcd and control plane availability, new actions in the cluster were not processed. Anything already running previous to the start of the incident continued to run. However, builds, autoscaling and other similar events sent to this cluster failed.
Around 04:30 UTC we escalated the issue to our cloud provider. Resolution took longer than expected for several reasons, including additional complexities in getting etcd and kube control plane back into a good state. We have multiple clusters in the region but we did not want to strain other clusters with the complete workload from the affected cluster. In order to ensure stability in other clusters we decided to add additional capacity to our fleet. Around 05:00 UTC we began setting up additional clusters and started making plans to migrate new deployments. Around 08:00 UTC a scale up operation began to revive the affected cluster. Around 09:15 UTC control plane and etcd resources began to recover. During the recovery process some existing workload became unstable, but our reconciler resolved the issue soon after. Around 09:40 UTC everything had recovered.
Post-incident
We've found better methods of escalation and recovery to ensure a similar delay doesn't happen in the future for cases like this. We've also been working on tuning our workloads so that a similar incident doesn't reoccur. We'll be adding additional monitoring and alarms for several specific scenarios that we've uncovered in our investigation. In addition, we're continuing our work to stand up additional compute resources so that individual cluster failures won't block things like deployments and scaling actions from completing. We've been working on a few initiatives along these lines already, but will be increasing the priority to ensure these items are completed soon.