dbt Cloud

Report a problemSubscribe to updates
Status pages / EMEA Cell 1 AZURE
Powered by
Privacy policy

·

Terms of service
Write-up
Bigquery connection jobs failing with error "Failed to generate profile due to incorrect credentials. Please double check credentials and try again."
Degraded performance
View the incident
Summary

On May 11, 2026, beginning at approximately 11:39am PDT, some customer accounts experienced failures across two surfaces of the dbt platform. All BigQuery v1 connection jobs in the affected accounts were cancelled before execution, and the Cloud UI crashed when users attempted to view or edit the connection profile assigned to an affected environment. This blocked impacted customers from running production jobs and from self-servicing a fix. The trigger was the completion of a progressive internal change rollout that switched credential reads from a legacy encrypted-blob storage path to a new key-value storage path. The new read path did not apply schema defaults to absent fields the way the legacy path did, which caused profile generation to fail for a small number of historical BigQuery v1 credential records whose stored shape relied on that defaulting behaviour. The issue was mitigated by reversing the rollout on our side and assisting impacted customers with generating new profiles via the API.

Impact

A small number of customer accounts were impacted across multiple regions and tenancy types. Within those accounts, customers experienced the following:

  • All BigQuery v1 connection jobs failed. Scheduled and manually triggered runs were cancelled before execution with the error: "Failed to generate profile due to incorrect credentials. Please double check credentials and try again." This blocked production data pipelines for the affected accounts.

  • The connection profile panel in the Cloud UI crashed. When users attempted to view or edit the connection profile assigned to an affected environment, the page returned a "Something went wrong" error caused by an underlying type error on a null credential field. This meant customers could not inspect, fix, or change credentials on the broken profile through the UI.

  • API-based reassignment was the only available self-service workaround. Customers could create a new profile with fresh credentials and reassign it to their environment via the API, but this was labor-intensive and required guidance from support.

Customers outside the affected handful of accounts were not impacted. For affected accounts, however, this was a complete outage of BigQuery v1 job execution and profile management, for the duration of the incident. We recognise the disruption this caused, particularly for customers running scheduled production workloads, and we apologise for the impact.

Root Cause

The root cause was a bug in how a newly rolled out credential read path constructed credential objects when fields were absent from the underlying storage. The new path substituted a null value for absent fields, whereas the legacy path filled missing fields with the dataclass default defined in code (for example, auth_type defaulting to "service-account-json"). Downstream profile generation then rejected the null, causing the failures customers observed.

The affected credentials were created in a valid shape under the API contract available at the time of their creation, and had been functioning correctly through the legacy read path. The legacy path applied schema defaults at read time for fields that customers had not explicitly set. The new read path did not apply those defaults, instead substituting a null value, which the downstream profile generation step then rejected. The customer credentials were unchanged throughout, we changed how the system reads them, and the change was not backwards compatible with valid credentials whose stored shape relied on read-time defaulting.

The safety mechanism we relied on to catch read-path inconsistencies shared the same assumption about field representation that the new read path was built on. Both behaved consistently with each other and inconsistently with the legacy path, so the safety check did not detect the regression. This is the lesson we are addressing in the remediation work.

Our pre-production environments contained typical credential data and did not include the historical records that triggered this failure. All automated tests passed, but they didn't cover this specific data shape.

Detection

This incident was detected when an affected customer contacted support, reporting that they could not make changes to a project profile and that their production jobs were failing. The incident was opened immediately, and a second affected customer reported the same symptoms shortly afterwards. Customer impact began at approximately 11:39am PDT, and the first report reached us some hours later. Factors that contributed to that delay: The volume of errors was very small relative to overall traffic and did not trigger our existing error-rate alerts. Only a small number of accounts were affected, so the signal was weak in aggregate metrics. Some affected customers run jobs on daily or less-frequent schedules, so failures only surfaced at the next scheduled run. The Cloud UI also crashed for affected profiles, which meant customers could not self-diagnose and went directly to support rather than working around the issue themselves.

Mitigation

The following actions were taken during the incident:

  • Manual workaround for the customer on the live call. While root cause was still under investigation, the team walked the customer through creating new profiles with fresh credentials and reassigning them to environments via the API. This unblocked their most critical environments but did not scale beyond the customer on the call.

  • Once we determined that the targeted segment-based rollback was not taking effect, we reverted the rollout globally. This immediately restored the legacy credential read path and resolved both the job failures and the UI errors for all affected customers.

  • A monitoring update was published to the dbt platform status page, and the incident was moved to resolved shortly after the reporting customers were verified to be operating normally.

  • A subsequent code change corrected the new read path so that absent fields are omitted (allowing schema defaults to apply) and updated the safety mechanism to compare against legacy values hydrated with defaults. Any future asymmetry between the two paths now routes safely to the legacy path.

Next Steps or Lessons Learned
Remediation or steps we have already taken
  • Reverted the rollout globally to restore the legacy credential read path.

  • Merged a code fix that corrects how the new read path handles absent fields and updates the safety mechanism so it no longer shares the same assumption as the new read path.

  • Verified that all known affected customers are operating normally on the legacy read path.

Planned Remediation
  • Validate gradual rollouts against production data shapes. For credential-related changes specifically, we are scoping a sampling check that exercises real production credential shapes through both the old and new code paths before a rollout completes, so any asymmetry surfaces before it reaches customers. This is the structural fix for this class of incident.

  • Audit production credential data shapes. The incident was caused by a data shape that test environments did not cover. An audit will identify other historical records that may be vulnerable to similar code-path changes, and will inform the validation phase of the ongoing credentials cleanup work.

  • Update the internal storage representation for the affected credentials so they carry schema defaults explicitly rather than relying on read-time defaulting. This is an internal change with no impact on customer configuration, and affected customers do not need to take any action.

  • Validate gradual rollout segment evaluation before relying on segments for rollback.

  • Update the internal incident-response guidelines so that for incidents where a recent rollout is the suspected cause and a global revert is reversible (for example, a feature flag), the default first response is a global revert rather than a targeted segment rollback.