Currents - Data processing delayed – Incident details

Data processing delayed

Resolved
Partial outage
Started about 2 months agoLasted about 5 hours

Affected

Data Pipeline

Partial outage from 1:00 PM to 5:47 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved

    All systems are operational.

    Impact Assessment

    • customers run results were significantly delayed for ~5 hours

    • the delays caused other issues with services that expected the results like webhook and integrations.

    Root cause analysis

    At ~8:20pm PST (3:20am UTC) we had an internal cleanup task run that removed old docker images used to deploy our data processing service. An oversight in how this was configured vs our processes resulted in the currently deployed service's docker image being removed.

    Between ~4:00am PST (11:00am UTC) and ~9:00am PST (4pm UTC) the system was not able to successfully take any scaling action while it tried to deploy the deleted image. There were still a small number of instances still running and processing tasks, but not enough to deal with the load.

    Once we re-deployed a newer image, we were able to quickly recover to normal operating status.

  • Monitoring
    Monitoring

    The service is now recovered, we are monitoring for additional errors.

    We experienced a scaling issue with our data processing service. This is resulting in run results and run start handling being delayed. Some delays may have resulted in failure to record runs.