Scheduled Job Run Cancellations - US Cell 2 AWS Status

Write-up

Scheduled Job Run Cancellations

Summary

From 2026-02-20 19:58 UTC to 2026-02-22 03:14 UTC, customers on multi-tenant (MT) and multi-cell (MC) dbt Platform instances may have experienced disruptions to run execution, Studio sessions, and metadata processing. A release introduced a bug that caused a post-run internal telemetry process to hang indefinitely. These stuck telemetry invocations created system-wide resource pressure, which cascaded into failures across other services.

We mitigated the incident by disabling telemetry, rolling back the release, and running cleanup scripts to remove stuck resources. We have also identified improvements to reduce blast radius, improve detection, and prevent recurrence.

Impact

During the incident window, customers on MT/MC instances may have experienced one or more of the following, with the largest impact in prod-us-c1:

Failed runs
- Most failures occurred when the system could not publish internal events after message broker connection limits were exceeded. These runs failed with Internal error before any steps executed.
- A smaller number of runs failed when they could not fetch publication artifacts for cross-project references due to upstream instability. These runs failed with Error retrieving publications.
Canceled runs
- Some runs were canceled after waiting 10 minutes to start due to capacity constraints. No steps executed for these runs.
- Customers may also have seen delays in system-initiated cancellations, including automatic cancellation of over-queued jobs and custom timeouts.
Delayed runs
- Some runs started late during periods of peak resource pressure.
Failed Studio sessions
- During peak pressure, workers could not start, which prevented some Studio sessions from launching.

We sincerely apologize to affected customers. Reliable execution is critical, and we are prioritizing changes that reduce the risk, blast radius, and likelihood of recurrence.

Root Cause

Internal run telemetry entered an infinite recursion loop, but only for projects that:

use dbt_adapters/macros/utils/equals.sql
use any macro that calls adapter.behavior

Because telemetry had no timeout, stuck telemetry processes held system capacity for as long as a run could execute, up to 24 hours.

Stuck resources accumulated until we hit critical levels of:

Node pressure: Run workloads are scheduled into a dedicated run pool when possible, and otherwise into a shared pool that also hosts other platform services. Once the shared pool filled, other platform experiences were impacted.
Message broker connection exhaustion: Run workloads maintain broker connections while executing. As stuck resources increased, broker connections approached and exceeded limits, causing sustained connection errors and additional hangs.

Mitigation and Recovery

We mitigated the incident through a combination of rollback and operational cleanup:

Rolled back the telemetry process to its last stable release.
Executed cleanup scripts to terminate:
- stuck run resources
- pending resources that were backfilling capacity and preventing critical services from scheduling
Monitored recovery as resources normalized and run volume and success rates returned toward baseline.
Re-enabled telemetry once recovery indicators were stable.

Next Steps and Preventative Measures

Completed

Fixed the infinite recursion in internal telemetry
Added a timeout to internal telemetry

Planned

Add additional telemetry guardrails to better isolate failures.
Reduce blast radius via workload isolation, so run-layer saturation does not impact other platform services
Improve monitoring and earlier detection for runaway resources growth, message broker connections, and instance capacity failures
Ensure critical dependent services retain sufficient capacity during abnormal traffic
Improve release process and test coverage for internal telemetry, including:
- expanding telemetry test coverage
- evaluating additional release gates for telemetry changes before broad promotion