Incident Summary
On Apr 14 we had a production degradation caused by the shared ops queue becoming saturated. This was not driven by an unusual traffic spike alone. The deeper issue was that a recent change introduced high-volume step-upload work into the hot path of the shared ops queue, so traffic patterns that had been tolerable before now created queue backlog, worker saturation, and customer-visible latency.
Timeline
09:32 UTC: jobs started accumulating in the queue and pending duration increased sharply.
09:32 UTC onward: writer workers saturated and latency-sensitive ops work started competing with a large volume of step-upload jobs.
11:30 UTC: main Redis cluster CPU reached 100%.
EU team member declared incident and escalated the issue to NA team. The response was significantly delayed due to lack of adequate notification setting on on-call person mobile device.
15:00 UTC: autoscaling and cleanup reduced the backlog, but in-flight update request expire after ~2 hours - some customers experienced data loss.
Root Cause
Data ingestion pipeline generates a distinct task to process step-level data, a new task is created in operations queue for every attempt with steps.
Those tasks are delayed and throttles by a fixed interval, which synchronized large batches into thundering herds instead of smoothing load.
A single task then became the dominant queue workload and starved more latency-sensitive ops tasks, spiking workers CPU to 100% and creating cascading effects on the rest of the system.
In short: the root cause was architectural. We moved a high-volume task into the shared hot path, and that queue was no longer reserved for the more time-sensitive work it previously handled.
Contributing Factors
During the last two weeks we significantly refactored our architecture and moved some components to a long-running ECS jobs in order to reduce the number of connections.
We moved step-processing task into the ops queue.
Writer capacity and autoscaling were not aggressive enough for this new workload shape.
We lacked early alerts on worker saturation.
Alerts
We already have alerts for the following infrastructure components:
Alert on ops queue waiting depth.
Alert on writer worker CPU saturation and sustained concurrency saturation.
Alert on Redis CPU saturation and latency.
Escalation And Communication
The most impactful issues was non-adequate escalation. Despite documented escalation procedures and policies, the on-call person notification setting on personal mobile device were silenced.
After the issue was resolved, Currents purchased a dedicated on-call SaaS with a dedicated mobile application that bypasses silenced notification settings. We tested and educated all team members on how to use it.
Technical Follow-ups