Runs delayed and CLI Invocations are timing-out

Write-up

Summary

On February 4, 2026, from 11:20 UTC to 13:15 UTC, dbt Cloud experienced elevated latency to several external services in our US multi-cell production environment.

During that time, customers experienced scheduled runs starting later than expected and some CLI requests failing or timing out.

Our investigation determined that the issue was caused by AWS network congestion and packet loss. This impacted our communication with external services, including AWS KMS, GitHub, Metronome, Stripe, and Okta.

Impact

Customers in the affected US multi-cell production environment experienced delayed scheduled runs between 11:20 UTC and 13:15 UTC on February 4, 2026.
4,377 of 37,490 total runs were delayed during the incident, approximately 11.7% of runs in the impacted environment.
Of the delayed runs, 67% were delayed by more than 2 minutes and 9.5% were delayed by more than 5 minutes.
Customers using CLI-related workflows encountered project configuration retrieval timeouts and HTTP 504 responses on affected requests. 892 unique users across 576 accounts encountered CLI timeout issues.
The degradation was associated with elevated latency to several external services, including AWS KMS, GitHub, Metronome, Stripe, and Okta.

We apologize for this outage, and to every customer that was affected. We take our responsibility for data stewardship very seriously and are making multiple improvements to our systems to ensure this kind of outage does not cause data loss in the future, and to ensure that we can restore service more quickly in the event of a similar failure.

Root Cause

The root cause of the issue was an AWS network infrastructure event that affected external calls from dbt Cloud in our US multi-cell production environment.

What happened

During the incident, multiple hypotheses were investigated, but the strongest internal signals pointed to a widespread network problem rather than a failure in any single service.
We observed AWS KMS timeouts, long multi-minute requests to retrieve git credentials in config-api, and elevated latency and timeout behavior across several external services, including AWS KMS, GitHub, Metronome, Stripe, and Okta.
For customer-facing workflows, this manifested as delayed scheduled runs and CLI request failures, including project configuration retrieval timeouts and HTTP 504 responses on affected invocation requests.
AWS later confirmed that Transit Gateway packet forwarding nodes in us-east-1 experienced congestion and packet drops in a single availability zone due to oversubscription of network routers.
This increased latency for external requests made by dbt Cloud and led to the delayed runs and CLI timeout errors experienced by customers.

Mitigation

The customer-facing impact subsided at 13:15 UTC on February 4, 2026, before any dbt-side technical change was required.
AWS reported that its capacity management system responded during the incident by adding capacity, and that traffic was later rebalanced.

Next Steps or Lessons Learned

Actions already taken

AWS added capacity during the event and later rebalanced the affected network traffic.
AWS migrated the affected traffic to dedicated Transit Gateway routers in us-east-1 to improve service reliability.
We completed an internal review identifying gaps in our ability to quickly recognize network issues that appear as latency across multiple external services.

Questions?

If you have any questions about this incident or its impact on your account, please reach out to your dbt Labs account team or contact dbt Labs Support.

Write-up

Runs delayed and CLI Invocations are timing-out

Degraded performance

View the incident

Summary

On February 4, 2026, from 11:20 UTC to 13:15 UTC, dbt Cloud experienced elevated latency to several external services in our US multi-cell production environment.

During that time, customers experienced scheduled runs starting later than expected and some CLI requests failing or timing out.

Impact

Customers in the affected US multi-cell production environment experienced delayed scheduled runs between 11:20 UTC and 13:15 UTC on February 4, 2026.
4,377 of 37,490 total runs were delayed during the incident, approximately 11.7% of runs in the impacted environment.
Of the delayed runs, 67% were delayed by more than 2 minutes and 9.5% were delayed by more than 5 minutes.
Customers using CLI-related workflows encountered project configuration retrieval timeouts and HTTP 504 responses on affected requests. 892 unique users across 576 accounts encountered CLI timeout issues.
The degradation was associated with elevated latency to several external services, including AWS KMS, GitHub, Metronome, Stripe, and Okta.

Root Cause

The root cause of the issue was an AWS network infrastructure event that affected external calls from dbt Cloud in our US multi-cell production environment.

What happened

During the incident, multiple hypotheses were investigated, but the strongest internal signals pointed to a widespread network problem rather than a failure in any single service.
We observed AWS KMS timeouts, long multi-minute requests to retrieve git credentials in config-api, and elevated latency and timeout behavior across several external services, including AWS KMS, GitHub, Metronome, Stripe, and Okta.
For customer-facing workflows, this manifested as delayed scheduled runs and CLI request failures, including project configuration retrieval timeouts and HTTP 504 responses on affected invocation requests.
AWS later confirmed that Transit Gateway packet forwarding nodes in us-east-1 experienced congestion and packet drops in a single availability zone due to oversubscription of network routers.
This increased latency for external requests made by dbt Cloud and led to the delayed runs and CLI timeout errors experienced by customers.

Mitigation

The customer-facing impact subsided at 13:15 UTC on February 4, 2026, before any dbt-side technical change was required.
AWS reported that its capacity management system responded during the incident by adding capacity, and that traffic was later rebalanced.

Next Steps or Lessons Learned

Actions already taken

AWS added capacity during the event and later rebalanced the affected network traffic.
AWS migrated the affected traffic to dedicated Transit Gateway routers in us-east-1 to improve service reliability.
We completed an internal review identifying gaps in our ability to quickly recognize network issues that appear as latency across multiple external services.

Questions?

If you have any questions about this incident or its impact on your account, please reach out to your dbt Labs account team or contact dbt Labs Support.