dbt deps failing in job runs
Incident Report for dbt Cloud
Postmortem

Summary

On Tuesday, Nov 7th, Between 12:01 PM UTC and 1:33 PM UTC an issue occurred on runs triggered via schedule, UI, and API. We noticed a spike in dbt deps command errors, which is one of the first steps every dbt Cloud run executes as a pre-requisite step.

Impact

  • During the outage, 40% of dbt runs failed at the dbt deps step.
  • All runs that could not execute dbt deps errored.
  • Our retry logic in dbt deps could not complete the command and we noticed a spike in errors showing SSL connection failures to hub.getdbt.com.

We apologize for this outage and to every customer that was affected. We are making multiple improvements to our systems to ensure this kind of outage does not cause run errors and users have alternatives to withstand third-party service unavailability.

Root Cause

The root cause of the issue was our third-party provider’s firewall systems caught and started blocking our traffic. dbt has built-in retries in cases with deps command where when we run into network connection issues, we can safely retry without aborting the command. The retry logic most likely exacerbated the issue.

Hosting provider change

In order to better service our customers and provide a higher SLA, dbt Labs switched to using Vercel as [hub.getdbt.com](<http://hub.getdbt.com>) hosting provider on October 2nd. All our tenants relied on hub’s availability to successfully run dbt deps command. Since this service is critical to our runs, to provide a high-quality service, we used Vercel’s edge caching and high throughput static site service to return metadata required to execute dbt deps command.

Starting, 12:01 PM UTC on November 7th, we noticed a spike in SSLError and dbt attempted to retry 5 times in order to recover from the network error. When dbt was unable to recover, dbt deps command failed for about 40% of runs on multi-tenant US deployment. The error was isolated to multi-tenant and we were unable to reproduce the error from other network locations. In order to recover, we redirected traffic to a different URI provided by Vercel and immediately escalated with their support team to understand the root cause. Upon further investigation, they revealed that their firewall systems blocked traffic from an IP address of dbt Cloud Multi-tenant US deployment.

Next Steps or Lessons Learned

Remediation of steps we have already taken

  • We unblocked the IP addresses of the Multi-tenant USA tenant from Vercel’s firewall system.
  • We are in enterprise-grade support conversation with Vercel to unblock all our IPs and follow best practices for asynchronous workers to interact with their services.
  • We have hosted a mirror of hub.getdbt.com on s3, and we will be transitioning to s3 as our primary provider next week.

Planned Remediation

Immediate mitigation and long term solution - ETA: End of Dec’23

  • We are planning to add automated fallback between hosting providers. When one hosting provider is unreachable, we will automatically retry against other hosting providers.
  • We are currently working on repository caching which is an opt-in feature for our users to re-use a cached version of their repository instead of attempting to fresh clone or fresh install packages.
  • When git clone or dbt deps fail, repository caching will restore the repository and packages from the last successful run in the environment.
Posted Nov 13, 2023 - 19:32 EST

Resolved
This issue has now been resolved. We are no longer seeing dbt deps failing, in job runs, with an SSL error.
Posted Nov 07, 2023 - 10:34 EST
Monitoring
We have identified the cause of the issue. There was a problem with the SSL authentication to the website provider where we host hub.getdbt.com. We have made a configuration change to fix the problem and are currently monitoring the status.
Posted Nov 07, 2023 - 09:08 EST
Investigating
We have had multiple reports of dbt deps failing in job runs. We are currently investigating the cause and hope to have an update soon.
Posted Nov 07, 2023 - 07:49 EST
This incident affected: North America (N. Virginia) (Scheduled Jobs), Europe (Frankfurt) (Scheduled Jobs), and Australia (Sydney) (Scheduled Jobs).