An issue occurred that impacted scheduled runs of jobs between 5:30PM - 10:30 PM Eastern (8/28 21:30 - 8/29 2:30 UTC) in the US Multi-tenant instance of dbt Cloud. AU and EMEA Multi-tenant instances were affected between 8/28 21:00 - 8/29 00:30 UTC.
During the outage, scheduled runs were stuck in a queued state and canceled automatically after 30 minutes in this state.
We apologize for this outage, and to every customer that was affected. We take our responsibility very seriously and are making multiple improvements to our systems and processes to prevent this kind of outage and to improve our recovery time.
Code change
A change was made to a fundamental part of our execution code, which caused runs to fail in an unexpected way and also prevented our fallbacks from picking up these runs. We execute an internal tool in a subprocess to collect run step logs, which accepts a dictionary of environment variables. Since we use Python, types are not enforced on this dictionary and an integer was set to it, which is not allowed with subprocess.Popen. This type of error would typically be caught by our manual testing and end-to-end test suite, however due to configuration differences in development and testing environments, this was missed. We have already remedied these differences in environments.
Mitigation
Upon detecting the regression, we immediately moved to revert the change, however we were hit with several other issues that delayed remediation. While attempting to deploy the fix GitHub was having an incident of its own. Our first attempt failed because GitHub did not register checks on our gitops repo for over 10 minutes and our second attempt failed because GitHub Actions was stuck waiting for a runner. Once our checks ran, we were able to deploy our fix. Once our environments were aware of the new code our AU and EMEA instance rolled out the new version, however our US MT instance encountered issues with an overwhelmed Kubernetes control plane and could not roll out the new version. We were eventually able to scale down the bad deployment and scale up the new version, which allowed runs to resume. At this point, there was an insurmountable number of queued runs, so we cancelled run more than 30 minutes old allowing the platform to resume normal operations.
Immediate monitoring and mitigation - ETA: End of Sep’23