Data processing delayed

This incident has been resolved

All systems are operational.

Impact Assessment

customers run results were significantly delayed for ~5 hours
the delays caused other issues with services that expected the results like webhook and integrations.

Root cause analysis

At ~8:20pm PST (3:20am UTC) we had an internal cleanup task run that removed old docker images used to deploy our data processing service. An oversight in how this was configured vs our processes resulted in the currently deployed service's docker image being removed.

Between ~4:00am PST (11:00am UTC) and ~9:00am PST (4pm UTC) the system was not able to successfully take any scaling action while it tried to deploy the deleted image. There were still a small number of instances still running and processing tasks, but not enough to deal with the load.

Once we re-deployed a newer image, we were able to quickly recover to normal operating status.

Currents - Data processing delayed – Incident details

All systems operational

This incident has been resolved

Impact Assessment

Root cause analysis