On Nov 8th 2023, an issue occurred impacting a small percentage of runs due to cluster resources being temporarily unavailable.
As soon as on-call teams were alerted, we updated our status page to let customers know that there was an ongoing incident and began taking measures to restore all run availability.
The root cause of the issue was a failure to reserve cluster resources (memory and CPU) causing new processes to be unable to perform expected actions.
Stale reserved resources
In order to provide a high quality of service within our service level guarantees we use reserved resources (memory and CPU) which are provisioned ahead of a requested run. When our systems receive a request to execute a run via scheduling, API or CI the reserved resource is assigned the task to execute. The reserved resource can be shut down due to maintenance or internal code updates or run cancellations. When the resource is shut down it safely terminates connections with our internal systems and closes down its processes safely.
On Nov 1st, we upgraded a third-party monitoring library which hooks into process-level information to provide us the ability to monitor the quality of service we are providing to our customers.
Starting Nov 5th at 6:30 AM UTC, some reserved resources were not being cleaned up and were holding on to cluster resources without utilizing them. To service new requests, new resources were being reserved and allocated to handle customer requests. We have a built-in overflow system that lets requests that could not be serviced by reserved resources fall back to on-demand resources. These on-demand resources can only be assigned if resources are available in the cluster.
Cluster resource health
Starting Nov 7th, with stale reserved resources claiming more memory and CPU, our cluster health started deteriorating. Our operations teams started provisioning more resources to the cluster to return it back to a healthy state and continued investigating the root cause. We noticed occurrences of a race condition that forced reserved resources to hang during their exit actions. We suspected the third-party library introduced a change in their upgrade that led to processes being stuck in an unknown state. These stuck processes slowly claimed more available reserved resources over time.
Between Nov 8th 7:28 AM UTC to 9:11 AM UTC, we learned that the reaper scripts were not cleaning up resources fast enough and our cluster was running out of memory. To get the cluster back into a healthy state, we over-provisioned the cluster with 3x resources. Investigations narrowed the cause and we began the process of rolling back the third-party monitoring library change. This resolved the issue.
Immediate monitoring and mitigation - ETA: End of Dec’23