On Monday, March 27th, at 19:06 Eastern time (23:06 UTC), we released a change that resulted in delays in ingesting metadata from completed runs in dbt Cloud. This impacted global multi-tenant environments, and all metadata events were impacted for the duration. This affected customers’ ability to query up-to-date information of their runs from the Metadata API, as well as their ability to view the Model Timing visualization in the dbt Cloud UI. Runs continued to be executed as expected; only queries related to the run metadata were affected.
At 20:11 Eastern time (00:11 UTC the next day), our engineering team reverted the change. Due to the number of runs where there was a delay in ingesting metadata, there was catching up to do, and until then users were still unable to reliably query up-to-date information of their runs from the Metadata API or view the Model Timing visualization in the dbt Cloud UI. Both capabilities were fully recovered by 20:35 Eastern time (00:35 UTC the next day).
We sincerely apologize for this outage, and to every customer that was impacted. We know that you rely on the dbt Cloud application and its APIs as a key tool. We have made multiple improvements to our systems and processes, and have more plans to prevent this type of outage in the future.
The root cause of the issue was a code change for a release where we intended to change the business logic for ingestion. To facilitate this change, we upgraded an internal package. This internal package includes an interface that consumes from a message queue for asynchronous processing.
Upgrading this internal package resulted in changes to the interface. However, we did not make the appropriate changes required to use the internal package on the new version we upgraded to.
This caused our ingestion service to break completely.
While we unit-tested our code to validate the business logic change of our ingestion, we did not perform an end-to-end test to make sure that consumption from the message queue continued to work. As such, we failed to catch this error during development.
We understand how critical dbt Cloud and its APIs are to your ability to get work done day-to-day. Furthermore, your experience with dbt Cloud matters to us. We’re grateful to you for your patience during this incident.