Slowdowns and high load

Updates

Resolved
January 12, 2025 at 4:02 AMUTC
Resolved
January 12, 2025 at 4:02 AMUTC
This incident has been resolved.
The root cause was identified as a bug in telemetry configuration for background tasks queue.
Adding telemetry data broke tasks deduplication, causing DB overload due to redundant write requests.
Monitoring
January 12, 2025 at 1:17 AMUTC
Monitoring
January 12, 2025 at 1:17 AMUTC
The system is now once again working as normal. During the incident you may have failed to create runs, and run data from the incident window may be missing in the dashboard, leading to timeout out runs and missing results.
New runs should now be working as expected as of Sat 11:30pm UTC
We are still closely monitoring for further errors. And we will be doing a deeper investigation into the root cause on Monday.
Several steps were taken to increase our database capacity during the incident, and canceling long running task. As well as a rollback of of some of our most recent deployed services.
It's not yet clear whether the rollback or the cancelling of long running tasks resolved the incident. We will be looking into the root cause deeper during regular business hours.
Identified
January 11, 2025 at 11:15 PMUTC
Identified
January 11, 2025 at 11:15 PMUTC
We have identified a bottleneck in DB writes that caused slow processing and accumulation of tasks in processing queues. Scaling up the DB cluster and terminating long-running operations restored the system stability.
Investigating the root case for the increase in DB resources consumption.
Investigating
January 11, 2025 at 8:24 PMUTC
Investigating
January 11, 2025 at 8:24 PMUTC
We are currently investigating this incident.
We are seeing an elevated number of errors, as well as load on our database. Run processing has been significantly delayed. We are investigating.

Currents - Slowdowns and high load – Incident details

All systems operational