dbt Cloud deploy jobs failing
Incident Report for dbt Cloud
Postmortem

Summary

On September 9, 2024 8:11 PM (UTC) a change in the job orchestration service was deployed to production leading to failures in job runs that did not use private keys for cloning Git repositories.

Our monitoring system quickly identified a surge in errors, and the issue was mitigated by reverting the change, which restored system functionality.

Impact

  • Customers on our multi-tenant or multi-cell environments that did not use SSH keys for Git repository cloning on their job runs would have observed their jobs failing during the impact period.
  • The failed job runs could be successfully re-run after the issue was mitigated.

Root Cause

A code change in the job orchestration service introduced a regression which lead to exceptions when processing job runs that do not use private keys for cloning Git repositories.

This regression was not caught before the deployment due to missing test coverage for this specific code path.

The issue had a broad impact because the service was deployed using a new pipeline that was not configured for a gradual rollout.

Mitigation and Prevention

  • Mitigation: Our monitoring system detected the error surge immediately following the deployment. The team reverted the change, restoring normal operations.
  • Test Coverage Improvement: We have now implemented the necessary tests to cover the code path that caused the regression.
  • Deployment Improvement: We are working on enhancing a deployment pipeline for this service to ensure future updates follow a more gradual rollout process, reducing the risk of widespread impact.

We sincerely apologize for this outage and to every customer that was impacted. We understand that you rely on the dbt Cloud application, especially job execution, as a key tool. We are confident that these measures will effectively prevent similar incidents in the future and ensure a more stable experience for all our users.

Posted Sep 12, 2024 - 21:04 EDT

Resolved
A code change to the run execution code resulted in errors during dbt deps which resulted in job runs failing. Our internal monitoring detected the issue and reverting the code change returned the system back to a healthy state.
We sincerely apologize for this incident to all customers impacted. The team will work to identify action items to prevent similar issues in the future.
Posted Sep 09, 2024 - 17:44 EDT
Update
A fix has been rolled out and we are seeing a decrease in errors and are monitoring. Customers should retry failed runs.
Posted Sep 09, 2024 - 17:17 EDT
Update
A fix is currently being rolled out to impacted instances. We will provide another update in 30 minutes. Please contact dbt Labs Support team via the UI chatbot or support@getdbt.com if you have any questions.
Posted Sep 09, 2024 - 16:59 EDT
Update
We are continuing to work on a fix for this issue.
Posted Sep 09, 2024 - 16:48 EDT
Identified
We have identified increased errors causing interrupted runs and jobs to fail within dbt Cloud. The root cause has been identified and we are working on rolling out a fix.
Posted Sep 09, 2024 - 16:35 EDT
This incident affected: North America (N. Virginia) (Scheduled Jobs), Europe (Frankfurt) (Scheduled Jobs), Australia (Sydney) (Scheduled Jobs), North America (AWS) Cell 1 (Scheduled Jobs), North America (AWS) Cell 2 (Scheduled Jobs), and Europe (Azure) Cell 1 (Scheduled Jobs).