Investigating dbt Cloud deployment runs failing due to PrivateLink errors
Incident Report for dbt Cloud
Postmortem

Summary

On November 22, 2024, from 4:13 AM to 6:48 AM UTC, job runs relying on PrivateLink connections failed across multiple tenants. This was caused by a code change that prevented the job execution service from properly using PrivateLink for warehouse connections. The issue was detected via a monitoring alert, and a rollback resolved the problem.

Impact

  • During the impact window (4:13 AM to 6:48 AM UTC), all jobs relying on PrivateLink connections failed to execute for dbt Cloud accounts in AWS multi-tenant and multi-cell environments.
  • While this accounted for approximately 1% of all job runs across our entire customer base, we recognize that customers that heavily dependent on PrivateLink may have been significantly impacted.
  • Impacted job runs failed with errors connecting to data warehouses and could be successfully re-run after the rollback.

Root Cause

A code change to environment variables caused incorrect PrivateLink configuration to be passed to the job execution service. This resulted in jobs runs not being able to connect to PrivateLink endpoints and ultimately failing with data warehouse connection errors.

Detection and Mitigation:

The issue was detected by our monitoring system. Once the faulty code change was identified, the issue was resolved via a rollback.

Total resolution time was longer than usual due to the incident occurring during the holiday deployment freeze, which required additional processes to lift the freeze and initiate the rollback.

Prevention:

  • Completed

    • The faulty code change was reverted following the release rollback.
    • Validation mechanisms were implemented to prevent similar misconfigurations in the future.
  • Longer-term (by end of Jan 2025):

    • Implement additional tests for PrivateLink scenarios to catch similar issues before they reach production.
    • We are in the process of larger architectural improvements, which include decoupling job scheduling functionality from other dbt Cloud services. This change will result in cleaner configurations management and will help prevent this kind of issues in the future.

We sincerely apologise for this outage and to every customer that was impacted. We understand that you rely on the dbt Cloud application, especially job execution, as a key tool. We are confident that these measures will effectively prevent similar incidents in the future and ensure a more stable experience for all our users.

Posted Dec 02, 2024 - 16:29 EST

Resolved
The issue has been resolved, and all affected systems are now functioning normally.

Please contact Support via email support@getdbt.com if you continue to experience delays and are unsure of the root cause.

We understand how critical dbt Cloud is to your ability to get work done day-to-day and your experience matters to us. We’re grateful to you for your patience during this incident.
Posted Nov 22, 2024 - 01:57 EST
Monitoring
Confirming that All instances are now operational. We are continually working on returning US Cell1 to an operational state as well. Thank you for your patience.
Posted Nov 22, 2024 - 01:22 EST
Update
We are continuing to work on a fix for this issue.
Posted Nov 22, 2024 - 01:17 EST
Identified
We have identified the underlying issue and a fix is being implemented, and we will provide an update shortly. Thanks for your patience.
Posted Nov 22, 2024 - 01:02 EST
Investigating
We're investigating an issue with dbt Cloud deployment runs failing due to error related to PrivateLink connection. The team is working on a resolution and we will provide updates at approximately 30 minute intervals or as soon as new information becomes available.
Posted Nov 22, 2024 - 00:05 EST
This incident affected: North America (N. Virginia) (Scheduled Jobs), Europe (Frankfurt) (Scheduled Jobs), Australia (Sydney) (Scheduled Jobs), North America (AWS) Cell 1 (Scheduled Jobs), North America (AWS) Cell 2 (Scheduled Jobs), Europe (Azure) Cell 1 (Scheduled Jobs), North America (AWS) Cell 3 (Scheduled Jobs), and North America (Azure) Cell 1 (Scheduled Jobs).