On Wednesday, April 2, 2025, starting at 12:44 UTC, we experienced an outage that affected dbt Labs’ AWS environments. While initially limited to a small subset of environments, at 15:00 UTC, the issue expanded to impact all dbt Labs AWS regions. Customers on non-AWS regions were unaffected by this incident. An automated operating system update prevented workloads in our production environments created after 12:44 UTC from establishing network connectivity. This severely impacted dbt Cloud services, including Cloud IDE, CLI, Visual Editor, and scheduled jobs. After identifying and resolving the underlying issue, services were fully restored across all affected environments by 23:25 UTC through corrective node replacements and configuration rollbacks.
During the incident, all newly created workloads failed to obtain network connectivity, causing widespread performance degradation and outages across multiple dbt Cloud products.
Cloud IDE: Users were unable to launch new IDE sessions and were prevented from using the IDE. Some users who had an active IDE session were unable to successfully complete invocations.
Scheduled Jobs: Job queuing and execution was significantly impacted, resulting in job delays and cancellations.
Cloud CLI: Users experienced failed invocations and were unable to use the CLI.
Visual Editor: Users were unable to run invocations.
dbt Labs monitoring first detected failures within the IDE in a subset of environments starting at 12:59 UTC, but the volume was below alert thresholds, which delayed investigation. Customer reports of service failures were first received at 14:04 UTC. dbt Support immediately began investigating and declared a high severity incident by 14:51 UTC.
Recognizing the immediate impact, our internal teams escalated the issue to the highest severity, and, due to the symptoms, engaged AWS support for enhanced troubleshooting. Our investigation determined that the outage was not a service provider issue from AWS, and instead stemmed from an internal automated process that deployed an operating system update.This introduced an unforeseen network configuration issue.
Once dbt Labs engineers successfully replicated the issue, the team developed, tested and deployed a fix to all affected infrastructure components, achieving full service restoration by 23:25 UTC.
The outage was caused by a faulty operating system update that was applied to nodes within our AWS environments by an automatic background process, bypassing our normal staged release process.
That update included a bug ( https://bugs.launchpad.net/cloud-images/+bug/2106107 ) that overwrote a critical network configuration responsible for assigning network connections. As a result of changing this configuration, newly created workloads were unable to establish network connectivity, rendering them inoperable and causing widespread service degradation and failures across major dbt Cloud services.
❓Why were only a small number of environments initially exhibiting this behavior?
Individual nodes in our production environments were configured to install system updates as they boot up. This resulted in new nodes inheriting the problematic update before existing nodes, while existing nodes continued to operate normally.
❓Why were all regions impacted starting at 15:00 UTC?
In addition to applying updates during boot, nodes were configured to install updates at the same time, regardless of region. Despite dbt regions being completely decoupled from each other, all nodes performed the same problematic update simultaneously at 15:00 UTC.
❓Why did this issue only impact AWS?
In AWS environments, we utilize base operating system images that are tailored for our AWS infrastructure. The bug was only applicable to a particular combination of the operating system package and the AWS-specific image. dbt Cloud environments in other cloud providers utilize different images and operating systems and were not affected.
As standard practice, we upgrade and replace existing infrastructure with new versions that have been tested prior to release. Once deployed, these changes are rolled out incrementally, environment-by-environment. For operating system updates and security patches, we bypassed this standard release process by directly applying these changes to our production environment, resulting in this incident.
To prevent this behavior in the future, we have now disabled all automated upgrade mechanisms from directly applying changes to our production environment. We are also actively auditing all infrastructure, software, and automation tooling to identify and eliminate any remaining deployment paths that could potentially bypass these mechanisms. Moving forward, all production changes, regardless of type, will adhere to our release and rollout process.
We also recognize the importance of clear, timely, and informative communication with our customers during incidents. To better serve our customers in future events, we are reviewing and refining our incident communication processes to ensure faster notification, regular status updates, clearer messaging around incident impact, and detailed status information through established communication channels.
Finally, this incident highlighted gaps in our observability that increased the time required to diagnose the issue and return the service to normal operations. We are actively investing in improved monitoring, logging, and alerting tools, along with clearer diagnostic workflows, to significantly shorten incident detection and resolution times. These enhancements will ensure quicker identification of underlying causes, faster resolution, and reduced customer impact in future incidents.
We sincerely apologize for this outage and its impact on your teams. We hold ourselves accountable to extremely high standards of reliability, and we take our responsibility as stewards of your data and workflows very seriously.
Our teams have already implemented immediate corrective actions and are rapidly deploying additional safeguards to protect against similar events. We are committed to continuously improving our operational excellence, ensuring minimal disruption and quick recovery should future issues arise.
Thank you for your continued trust and partnership. If you have any questions or need further assistance, please contact us directly at support@getdbt.com.