On Tuesday, Nov 7th, Between 12:01 PM UTC and 1:33 PM UTC an issue occurred on runs triggered via schedule, UI, and API. We noticed a spike in dbt deps
command errors, which is one of the first steps every dbt Cloud run executes as a pre-requisite step.
dbt deps
step.dbt deps
errored.hub.getdbt.com
.We apologize for this outage and to every customer that was affected. We are making multiple improvements to our systems to ensure this kind of outage does not cause run errors and users have alternatives to withstand third-party service unavailability.
The root cause of the issue was our third-party provider’s firewall systems caught and started blocking our traffic. dbt
has built-in retries in cases with deps
command where when we run into network connection issues, we can safely retry without aborting the command. The retry logic most likely exacerbated the issue.
Hosting provider change
In order to better service our customers and provide a higher SLA, dbt Labs switched to using Vercel as [hub.getdbt.com](<http://hub.getdbt.com>)
hosting provider on October 2nd. All our tenants relied on hub’s availability to successfully run dbt deps
command. Since this service is critical to our runs, to provide a high-quality service, we used Vercel’s edge caching and high throughput static site service to return metadata required to execute dbt deps command.
Starting, 12:01 PM UTC on November 7th, we noticed a spike in SSLError
and dbt
attempted to retry 5 times in order to recover from the network error. When dbt was unable to recover, dbt deps
command failed for about 40% of runs on multi-tenant US deployment. The error was isolated to multi-tenant and we were unable to reproduce the error from other network locations. In order to recover, we redirected traffic to a different URI provided by Vercel and immediately escalated with their support team to understand the root cause. Upon further investigation, they revealed that their firewall systems blocked traffic from an IP address of dbt Cloud Multi-tenant US deployment.
Immediate mitigation and long term solution - ETA: End of Dec’23
git clone
or dbt deps
fail, repository caching will restore the repository and packages from the last successful run in the environment.