Currents - Slowdowns and high load – Incident details

Slowdowns and high load

Resolved
Major outage
Started 6 days agoLasted about 8 hours

Affected

API - Reporting and Orchestration

Degraded performance from 8:24 PM to 11:15 PM, Major outage from 11:15 PM to 1:17 AM, Operational from 1:17 AM to 4:02 AM

Data Pipeline

Operational from 8:24 PM to 11:15 PM, Major outage from 11:15 PM to 1:17 AM, Operational from 1:17 AM to 4:02 AM

Updates
  • Resolved
    Resolved

    This incident has been resolved.

    The root cause was identified as a bug in telemetry configuration for background tasks queue.

    Adding telemetry data broke tasks deduplication, causing DB overload due to redundant write requests.

  • Monitoring
    Monitoring

    The system is now once again working as normal. During the incident you may have failed to create runs, and run data from the incident window may be missing in the dashboard, leading to timeout out runs and missing results.

    New runs should now be working as expected as of Sat 11:30pm UTC

    We are still closely monitoring for further errors. And we will be doing a deeper investigation into the root cause on Monday.

    Several steps were taken to increase our database capacity during the incident, and canceling long running task. As well as a rollback of of some of our most recent deployed services.

    It's not yet clear whether the rollback or the cancelling of long running tasks resolved the incident. We will be looking into the root cause deeper during regular business hours.

  • Identified
    Identified

    We have identified a bottleneck in DB writes that caused slow processing and accumulation of tasks in processing queues. Scaling up the DB cluster and terminating long-running operations restored the system stability.

    Investigating the root case for the increase in DB resources consumption.

  • Investigating
    Investigating

    We are currently investigating this incident.

    We are seeing an elevated number of errors, as well as load on our database. Run processing has been significantly delayed. We are investigating.