Affected
Partial outage from 1:00 PM to 5:47 PM
- ResolvedResolved
This incident has been resolved
All systems are operational.
Impact Assessment
customers run results were significantly delayed for ~5 hours
the delays caused other issues with services that expected the results like webhook and integrations.
Root cause analysis
At ~8:20pm PST (3:20am UTC) we had an internal cleanup task run that removed old docker images used to deploy our data processing service. An oversight in how this was configured vs our processes resulted in the currently deployed service's docker image being removed.
Between ~4:00am PST (11:00am UTC) and ~9:00am PST (4pm UTC) the system was not able to successfully take any scaling action while it tried to deploy the deleted image. There were still a small number of instances still running and processing tasks, but not enough to deal with the load.
Once we re-deployed a newer image, we were able to quickly recover to normal operating status. - MonitoringMonitoring
The service is now recovered, we are monitoring for additional errors.
We experienced a scaling issue with our data processing service. This is resulting in run results and run start handling being delayed. Some delays may have resulted in failure to record runs.