Currents - Slowness in data reporting and ingestion – Incident details

Slowness in data reporting and ingestion

Resolved
Major outage
Started 7 days agoLasted about 6 hours

Affected

Data Injestion

Operational from 9:30 AM to 11:30 AM, Major outage from 11:30 AM to 3:32 PM

Data Pipeline

Major outage from 9:30 AM to 3:32 PM

Updates
  • Postmortem
    Postmortem

    Incident Summary

    On Apr 14 we had a production degradation caused by the shared ops queue becoming saturated. This was not driven by an unusual traffic spike alone. The deeper issue was that a recent change introduced high-volume step-upload work into the hot path of the shared ops queue, so traffic patterns that had been tolerable before now created queue backlog, worker saturation, and customer-visible latency.

    Timeline

    • 09:32 UTC: jobs started accumulating in the queue and pending duration increased sharply.

    • 09:32 UTC onward: writer workers saturated and latency-sensitive ops work started competing with a large volume of step-upload jobs.

    • 11:30 UTC: main Redis cluster CPU reached 100%.

    • EU team member declared incident and escalated the issue to NA team. The response was significantly delayed due to lack of adequate notification setting on on-call person mobile device.

    • 15:00 UTC: autoscaling and cleanup reduced the backlog, but in-flight update request expire after ~2 hours - some customers experienced data loss.

    Root Cause

    • Data ingestion pipeline generates a distinct task to process step-level data, a new task is created in operations queue for every attempt with steps.

    • Those tasks are delayed and throttles by a fixed interval, which synchronized large batches into thundering herds instead of smoothing load.

    • A single task then became the dominant queue workload and starved more latency-sensitive ops tasks, spiking workers CPU to 100% and creating cascading effects on the rest of the system.

    In short: the root cause was architectural. We moved a high-volume task into the shared hot path, and that queue was no longer reserved for the more time-sensitive work it previously handled.

    Contributing Factors

    • During the last two weeks we significantly refactored our architecture and moved some components to a long-running ECS jobs in order to reduce the number of connections.

    • We moved step-processing task into the ops queue.

    • Writer capacity and autoscaling were not aggressive enough for this new workload shape.

    • We lacked early alerts on worker saturation.

    Alerts

    We already have alerts for the following infrastructure components:

    • Alert on ops queue waiting depth.

    • Alert on writer worker CPU saturation and sustained concurrency saturation.

    • Alert on Redis CPU saturation and latency.

    Escalation And Communication

    The most impactful issues was non-adequate escalation. Despite documented escalation procedures and policies, the on-call person notification setting on personal mobile device were silenced.

    After the issue was resolved, Currents purchased a dedicated on-call SaaS with a dedicated mobile application that bypasses silenced notification settings. We tested and educated all team members on how to use it.

    Technical Follow-ups

    • Remove hot-path steps processing task the shared ops queue.

    • Add the alerts above with explicit thresholds and owners.

  • Resolved
    Resolved

    This incident has been resolved. A detailed post-mortem will follow.

  • Monitoring
    Monitoring

    We are monitoring ingestion pipeline performance and stability before declaring that the incident is resolved.

  • Identified
    Identified

    The issue was caused by a stalled bullmq queue and a slowness in redis cluster. We have restored the operational capacity, investigating the root cause.

  • Investigating
    Investigating
    We are currently investigating this incident.