Job runs are currently being queued up indefinitely in dbt Cloud
Incident Report for dbt Cloud
Postmortem

Summary

An issue occurred that impacted scheduled runs of jobs between 5:30PM - 10:30 PM Eastern (8/28 21:30 - 8/29 2:30 UTC) in the US Multi-tenant instance of dbt Cloud. AU and EMEA Multi-tenant instances were affected between 8/28 21:00 - 8/29 00:30 UTC.

  • During this time, runs became stuck in a queued state and were canceled after 30 minutes.
  • From 8/29 00:00 - 2:30 UTC, the US Multi-tenant Kubernetes control plane was overwhelmed and unable to rollout our fix.

Impact

During the outage, scheduled runs were stuck in a queued state and canceled automatically after 30 minutes in this state.

We apologize for this outage, and to every customer that was affected. We take our responsibility very seriously and are making multiple improvements to our systems and processes to prevent this kind of outage and to improve our recovery time.

Root Cause

Code change

A change was made to a fundamental part of our execution code, which caused runs to fail in an unexpected way and also prevented our fallbacks from picking up these runs. We execute an internal tool in a subprocess to collect run step logs, which accepts a dictionary of environment variables. Since we use Python, types are not enforced on this dictionary and an integer was set to it, which is not allowed with subprocess.Popen. This type of error would typically be caught by our manual testing and end-to-end test suite, however due to configuration differences in development and testing environments, this was missed. We have already remedied these differences in environments.

Mitigation

Upon detecting the regression, we immediately moved to revert the change, however we were hit with several other issues that delayed remediation. While attempting to deploy the fix GitHub was having an incident of its own. Our first attempt failed because GitHub did not register checks on our gitops repo for over 10 minutes and our second attempt failed because GitHub Actions was stuck waiting for a runner. Once our checks ran, we were able to deploy our fix. Once our environments were aware of the new code our AU and EMEA instance rolled out the new version, however our US MT instance encountered issues with an overwhelmed Kubernetes control plane and could not roll out the new version. We were eventually able to scale down the bad deployment and scale up the new version, which allowed runs to resume. At this point, there was an insurmountable number of queued runs, so we cancelled run more than 30 minutes old allowing the platform to resume normal operations.

Next Steps or Lessons Learned

Remediation of steps we have already taken

  • We have locked down core parts of our execution with CODEOWNERS to ensure the right people review any changes to this code
  • We have updated development and testing environments to execute in the same way as production
  • We have added additional steps to our user acceptance testing process that ensures core workflows work in all environments

Planned Remediation

Immediate monitoring and mitigation - ETA: End of Sep’23

  • Improve metrics and alerting for runs failing to start
  • Expand our usage of type checking in Python (mypy)
  • Add additional unit and end-to-end testing around the specific issue at the root of this incident
  • Improve our exception handling around this fundamental part of our execution environment
  • Improve our code rollout and rollback processes to make it more expedient and less susceptible to Kubernetes control plane congestion
  • Support automatic rollbacks of deployments experiencing higher than normal error volumes
Posted Sep 06, 2023 - 10:50 EDT

Resolved
The issue has been resolved, and all affected systems are now functioning normally.

Please contact Support via chat or email support@getdbt.com if you continue to experience delays and are unsure of the root cause.

We understand how critical dbt Cloud is to your ability to get work done day-to-day and your experience matters to us. We’re grateful to you for your patience during this incident.
Posted Aug 29, 2023 - 00:30 EDT
Monitoring
Fixes have now rolled out across dbt Cloud US. To get the backlog of queued runs back into a healthy state, we will be manually cancelling some queued runs. These cancelled runs will have a message of: "This run was interrupted by dbt Cloud because of an underlying infrastructure issue. Please retry the run, or contact support if runs continue to cancel.".

Any newly triggered run from here on should run as expected.
Posted Aug 28, 2023 - 23:39 EDT
Update
Fixes are slowly rolling out across dbt Cloud US and some runs are now able to kick off as expected. We will update once again when the full roll out is complete.
Posted Aug 28, 2023 - 22:10 EDT
Update
The fix for queued runs has been deployed in dbt Cloud AU and EMEA. Previously queued runs may have been cancelled due to inactivity but future runs should run as scheduled. Manually triggered runs should also run as expected.

Fixes are still rolling out for dbt Cloud US.
Posted Aug 28, 2023 - 20:47 EDT
Identified
Engineering has identified a faulty commit and are rolling back changes to address the indefinite queue times with scheduled jobs.
Posted Aug 28, 2023 - 19:01 EDT
Investigating
Job runs are currently being queued up indefinitely in dbt Cloud. We're looking into this internally and will provide an update ASAP.
Posted Aug 28, 2023 - 18:25 EDT
This incident affected: North America (N. Virginia) (Scheduled Jobs), Europe (Frankfurt) (Scheduled Jobs), and Australia (Sydney) (Scheduled Jobs).