dbt Cloud Metadata API returning stale data
Incident Report for dbt Cloud
Postmortem

Summary

On Monday, March 27th, at 19:06 Eastern time (23:06 UTC), we released a change that resulted in delays in ingesting metadata from completed runs in dbt Cloud. This impacted global multi-tenant environments, and all metadata events were impacted for the duration. This affected customers’ ability to query up-to-date information of their runs from the Metadata API, as well as their ability to view the Model Timing visualization in the dbt Cloud UI. Runs continued to be executed as expected; only queries related to the run metadata were affected.

At 20:11 Eastern time (00:11 UTC the next day), our engineering team reverted the change. Due to the number of runs where there was a delay in ingesting metadata, there was catching up to do, and until then users were still unable to reliably query up-to-date information of their runs from the Metadata API or view the Model Timing visualization in the dbt Cloud UI. Both capabilities were fully recovered by 20:35 Eastern time (00:35 UTC the next day).

Impact

  • During the outage, customers were not able to query up-to-date information from the Metadata API or view the Model Timing visualization in dbt Cloud.
  • By 20:35 Eastern time (00:35 UTC the next day), ingestion was back to normal, all historical events had been replayed, and no more delays were observed.

We sincerely apologize for this outage, and to every customer that was impacted. We know that you rely on the dbt Cloud application and its APIs as a key tool. We have made multiple improvements to our systems and processes, and have more plans to prevent this type of outage in the future.

Root Cause

The root cause of the issue was a code change for a release where we intended to change the business logic for ingestion. To facilitate this change, we upgraded an internal package. This internal package includes an interface that consumes from a message queue for asynchronous processing.

Upgrading this internal package resulted in changes to the interface. However, we did not make the appropriate changes required to use the internal package on the new version we upgraded to.

This caused our ingestion service to break completely.

While we unit-tested our code to validate the business logic change of our ingestion, we did not perform an end-to-end test to make sure that consumption from the message queue continued to work. As such, we failed to catch this error during development.

Next Steps

  • We are introducing integration or end-to-end tests to make sure that for every release, the interface to consume from the message queue continues to work.

We understand how critical dbt Cloud and its APIs are to your ability to get work done day-to-day. Furthermore, your experience with dbt Cloud matters to us. We’re grateful to you for your patience during this incident.

Posted Apr 04, 2023 - 17:18 EDT

Resolved
The issue has been resolved, and all affected systems are now functioning normally as of 2023-03-28 00:11 UTC. Between 2023-03-28 23:12 UTC and 2023-03-28 00:11 UTC an issue with the Metadata Ingestion endpoint that resulted in the Metadata API to return stale data. Metadata ingestion has now returned to normal.

Please contact Support via chat or email support@getdbt.com if you continue to experience delays or issues with the Metadata API.
Posted Mar 27, 2023 - 20:39 EDT
Monitoring
We have deployed a fix for the Metadata Ingestion layer that was causing the Metadata API to return stale data. We're continuing to monitor the situation.
Posted Mar 27, 2023 - 20:20 EDT
Investigating
We're investigating an issue with the dbt Cloud metadata API that is causing stale data to be returned. This is impacting the metadata API in dbt Cloud (All Regions) MT for runs that took place after 2023-03-27 23:12 UTC. The team is working on a resolution and we will provide updates at approximately 15 minute intervals or as soon as new information becomes available.
Posted Mar 27, 2023 - 20:04 EDT
This incident affected: Europe (Frankfurt) (Metadata API, Metadata Ingestion), North America (N. Virginia) (Metadata API, Metadata Ingestion), and Australia (Sydney) (Metadata API, Metadata Ingestion).