Slowness in data reporting and ingestion

Postmortem

April 15, 2026 at 3:47 AM

Postmortem

April 15, 2026 at 3:47 AM

Incident Summary

On Apr 14 we had a production degradation caused by the shared ops queue becoming saturated. This was not driven by an unusual traffic spike alone. The deeper issue was that a recent change introduced high-volume step-upload work into the hot path of the shared ops queue, so traffic patterns that had been tolerable before now created queue backlog, worker saturation, and customer-visible latency.

Timeline

09:32 UTC: jobs started accumulating in the queue and pending duration increased sharply.
09:32 UTC onward: writer workers saturated and latency-sensitive ops work started competing with a large volume of step-upload jobs.
11:30 UTC: main Redis cluster CPU reached 100%.
EU team member declared incident and escalated the issue to NA team. The response was significantly delayed due to lack of adequate notification setting on on-call person mobile device.
15:00 UTC: autoscaling and cleanup reduced the backlog, but in-flight update request expire after ~2 hours - some customers experienced data loss.

Root Cause

Data ingestion pipeline generates a distinct task to process step-level data, a new task is created in operations queue for every attempt with steps.
Those tasks are delayed and throttles by a fixed interval, which synchronized large batches into thundering herds instead of smoothing load.
A single task then became the dominant queue workload and starved more latency-sensitive ops tasks, spiking workers CPU to 100% and creating cascading effects on the rest of the system.

In short: the root cause was architectural. We moved a high-volume task into the shared hot path, and that queue was no longer reserved for the more time-sensitive work it previously handled.

Contributing Factors

During the last two weeks we significantly refactored our architecture and moved some components to a long-running ECS jobs in order to reduce the number of connections.
We moved step-processing task into the ops queue.
Writer capacity and autoscaling were not aggressive enough for this new workload shape.
We lacked early alerts on worker saturation.

Alerts

We already have alerts for the following infrastructure components:

Alert on ops queue waiting depth.
Alert on writer worker CPU saturation and sustained concurrency saturation.
Alert on Redis CPU saturation and latency.

Escalation And Communication

The most impactful issues was non-adequate escalation. Despite documented escalation procedures and policies, the on-call person notification setting on personal mobile device were silenced.

After the issue was resolved, Currents purchased a dedicated on-call SaaS with a dedicated mobile application that bypasses silenced notification settings. We tested and educated all team members on how to use it.

Technical Follow-ups

Remove hot-path steps processing task the shared ops queue.
Add the alerts above with explicit thresholds and owners.

Resolved

April 14, 2026 at 3:32 PM

Resolved

April 14, 2026 at 3:32 PM

This incident has been resolved. A detailed post-mortem will follow.

Monitoring

April 14, 2026 at 3:16 PM

Monitoring

April 14, 2026 at 3:16 PM

We are monitoring ingestion pipeline performance and stability before declaring that the incident is resolved.

Identified

April 14, 2026 at 3:07 PM

Identified

April 14, 2026 at 3:07 PM

The issue was caused by a stalled bullmq queue and a slowness in redis cluster. We have restored the operational capacity, investigating the root cause.

Investigating

April 14, 2026 at 9:30 AM