At 2:21pm Eastern time, a change was made to the production Kubernetes cluster that powers dbt Cloud. This change resulted in a gradual increase in error rates to the application over the course of a 9 minute period. At 2:30pm, all requests to the dbt Cloud application were failing with a 503 Unavailable status code.
This change was related to permissions, and in addition to taking down the application, it also prevented our engineers from accessing the cluster. We worked to regain access to the cluster and remedy the root cause of the production outage. At 2:44pm Eastern time, a fix was implemented, and the application gradually became available again.
During the initial 9 minute partial outage and subsequent 14 minute complete outage, web requests to the application failed with a 503 status code. Scheduled dbt runs which were started or completed during this window may appear in a “cancelled” or "errored" state. Scheduled runs which were in progress but did not complete during the outage should be unaffected. IDE sessions that were in-progress at the time of the outage should also be unaffected.
We are updating our procedures as a result of this outage to prevent this failure mode from reoccurring in the future.
Posted Jun 16, 2020 - 16:26 EDT
A fix has been implemented and we are monitoring the results.
Posted Jun 16, 2020 - 14:51 EDT
The issue has been identified and a fix is being implemented.
Posted Jun 16, 2020 - 14:38 EDT
We are currently investigating this issue.
Posted Jun 16, 2020 - 14:35 EDT
This incident affected: Scheduled Jobs, API, and Web Application (cloud.getdbt.com).