Customers hosted on AU instance experiencing jobs queueing
Incident Report for dbt Cloud
Postmortem

Summary

On July 21st, multiple customers in the AU MT region experienced job runs not starting and subsequently timing out after 30 minutes. This issue was traced back to a RabbitMQ exchange in the AU region not routing messages correctly between our job scheduling services.

Since this incident, we have implemented several enhancements to our systems to prevent such outages, including additional alerting and monitoring to enable faster detection and resolution of similar issues in the future.

Impact

  • 47 customer accounts across AU were impacted during this outage
  • In total, 818 out of 5090 job runs (approximately 16%) timed out across the impacted customers.
  • Impacted runs can be manually restarted. Scheduled runs will continue to trigger normally.

Root Cause

We currently use RabbitMQ as a messaging broker for job orchestration. During the outage, two of the RabbitMQ exchanges in the AU region failed to route messages properly between our job scheduling services due to an unexpected behavior in a custom routing plugin. This plugin, designed to optimize message distribution, encountered an edge case that affected its core functionality. Consequently, a portion of the job runs were never queued for execution and they were subsequently canceled by a cleanup service after 30 minutes.

Unfortunately, our system monitors did not detect the condition of jobs not being properly queued and thus the issue was only identified and resolved after receiving reports from customers.

Mitigation and Prevention

  • Immediate Steps: We addressed the issue by restarting the backend queueing and job scheduling services, which successfully restored normal operations.
  • Long-term Solution: We are initiating a change to our backend scheduling architecture so that it is no longer reliant on a custom routing plugin. This change will align with the architecture already implemented in other areas of our platform, which has demonstrated superior stability.
  • Alerting and Monitoring: We have implemented a mechanism for alerting and monitoring to immediately detect similar conditions in the future. This approach will significantly improve our ability to identify and mitigate any potential issues before they impact services.

We sincerely apologize for this outage and to every customer that was impacted. We understand that you rely on the dbt Cloud application, especially job execution, as a key tool. We are confident that these measures will effectively prevent similar incidents in the future and ensure a more stable experience for all our users.

Posted Jul 26, 2024 - 22:19 EDT

Resolved
The issue has been resolved, and all affected systems are now functioning normally.
Please contact Support via email at support@getdbt.com if you continue to experience any issues and are unsure of the root cause.
We understand how critical dbt Cloud is to your ability to get work done day-to-day and your experience matters to us. We’re grateful to you for your patience during this incident.
Posted Jul 22, 2024 - 00:03 EDT
Update
We believe this issue is now resolved. Please note that any prior runs may still be stuck and require to be re-triggered. Please contact us at support@getdbt.com if you require any assistance or have any questions or concerns.
Posted Jul 21, 2024 - 23:17 EDT
Monitoring
We have deployed a fix for the reported issue that was causing jobs to queue and/or time out. Please reach out to support@getdbt.com if you're experiencing any issues or have any questions or concerns.
Posted Jul 21, 2024 - 22:59 EDT
Update
We are continuing to investigate this issue. Please reach out to us at support@getdbt.com with any questions or concerns.
Posted Jul 21, 2024 - 21:51 EDT
Investigating
We're investigating an issue with jobs on dbt Cloud accounts hosted on AU instance being stuck in a "starting" stage. The team is working on a resolution and we will provide updates at approximately 30 minute intervals or as soon as new information becomes available.
Posted Jul 21, 2024 - 20:50 EDT
This incident affected: Australia (Sydney) (Scheduled Jobs).