Multiple reports of job runs timing out
Incident Report for dbt Cloud
Postmortem

Summary

On Nov 8th 2023, an issue occurred impacting a small percentage of runs due to cluster resources being temporarily unavailable.

  • From approximately 6:30 AM UTC to 7:28 AM UTC, <0.01% of our runs were canceled.
  • From 7:28 AM UTC to 9:11 AM UTC, a small % of runs (<1%) could not get resources to start executing a run and were automatically canceled. The percentage of runs was high enough to alert our on-call teams to immediately take action.

As soon as on-call teams were alerted, we updated our status page to let customers know that there was an ongoing incident and began taking measures to restore all run availability.

Impact

  • During the outage, <1% of Multi-tenant US deployment runs were impacted and were automatically canceled after 30 minutes.
  • Re-running a failed run was successful in all cases. This required manual user intervention.
  • Users were still able to view canceled runs and re-running resolved the issue.

Root Cause

The root cause of the issue was a failure to reserve cluster resources (memory and CPU) causing new processes to be unable to perform expected actions.

Stale reserved resources

In order to provide a high quality of service within our service level guarantees we use reserved resources (memory and CPU) which are provisioned ahead of a requested run. When our systems receive a request to execute a run via scheduling, API or CI the reserved resource is assigned the task to execute. The reserved resource can be shut down due to maintenance or internal code updates or run cancellations. When the resource is shut down it safely terminates connections with our internal systems and closes down its processes safely.

On Nov 1st, we upgraded a third-party monitoring library which hooks into process-level information to provide us the ability to monitor the quality of service we are providing to our customers.

Starting Nov 5th at 6:30 AM UTC, some reserved resources were not being cleaned up and were holding on to cluster resources without utilizing them. To service new requests, new resources were being reserved and allocated to handle customer requests. We have a built-in overflow system that lets requests that could not be serviced by reserved resources fall back to on-demand resources. These on-demand resources can only be assigned if resources are available in the cluster.

Cluster resource health

Starting Nov 7th, with stale reserved resources claiming more memory and CPU, our cluster health started deteriorating. Our operations teams started provisioning more resources to the cluster to return it back to a healthy state and continued investigating the root cause. We noticed occurrences of a race condition that forced reserved resources to hang during their exit actions. We suspected the third-party library introduced a change in their upgrade that led to processes being stuck in an unknown state. These stuck processes slowly claimed more available reserved resources over time.

Between Nov 8th 7:28 AM UTC to 9:11 AM UTC, we learned that the reaper scripts were not cleaning up resources fast enough and our cluster was running out of memory. To get the cluster back into a healthy state, we over-provisioned the cluster with 3x resources. Investigations narrowed the cause and we began the process of rolling back the third-party monitoring library change. This resolved the issue.

Next Steps or Lessons Learned

Remediation of steps we have already taken

  • We reverted the third-party monitoring library version upgrade.
  • We over-provisioned the cluster to provide more headroom for unforeseen circumstances.
  • We introduced reaper scripts that can forcibly and quickly remove any stale reserved resources on-demand.
  • We improved our monitoring of the overall health of the cluster by alerting us when trends of memory or CPU are going up.

Planned Remediation

Immediate monitoring and mitigation - ETA: End of Dec’23

  • We are communicating with our third-party monitoring service on how we can safely upgrade their library.
  • Add monitors for automated cancellation trends.
  • Build a kill switch so reserved resources can force a shutdown in the event of emergencies.
Posted Nov 13, 2023 - 19:36 EST

Resolved
This incident has been resolved. Scheduled jobs are back in a healthy state.
Posted Nov 08, 2023 - 08:13 EST
Monitoring
We have fixed the issue where a small percentage of runs are failing with the error “This run timed out after 30 minutes of inactivity". We are monitoring the cluster to make sure we are in a good state. Please reach out to support@getdbt.com if you see any failures with this error.
Posted Nov 08, 2023 - 07:15 EST
Identified
We have identified an issue where a small percentage of runs are failing with the error “This run timed out after 30 minutes of inactivity. Please retry the run, or contact support if this job continues to fail”.
We are currently working on fixing this issue.
Posted Nov 08, 2023 - 05:22 EST
Investigating
We have received multiple reports of job runs timing out. We are currently investigating the cause and hope to have an update soon.
Posted Nov 08, 2023 - 04:13 EST
This incident affected: North America (N. Virginia) (Scheduled Jobs), Europe (Frankfurt) (Scheduled Jobs), and Australia (Sydney) (Scheduled Jobs).