Runs are not being updated and timed out

Updates

Resolved
October 20, 2025 at 10:21 PMUTC
Resolved
October 20, 2025 at 10:21 PMUTC
Our systems are back to normal. After AWS restored their services we were able to start processing the incoming data. There's still a backlog of non-processes event accumulated during the outage.
- Our focus is on processing the newly created runs without any delay.
- Due to nature of our service there's less value in real-time processing of the delayed events because the associated runs are already expired and timed out.
- The backlog of post-process still to be process for analytics and performance analysis.
This outage revealed a few performance and resilience related issues with our system. We will follow up with a more detailed analysis.
Identified
October 20, 2025 at 12:44 PMUTC
Identified
October 20, 2025 at 12:44 PMUTC
Due to AWS us-east-1 outage we are still experiencing issues. There is a big backlog of data update tasks we are unable to process because of lack of EC2 resources.
We are looking for alternative runtimes to restore the functionality.
Relevant excerpt from AWS:

[04:48 AM PDT] We continue to work to fully restore new EC2 launches in US-EAST-1. We recommend EC2 Instance launches that are not targeted to a specific Availability Zone (AZ) so that EC2 has flexibility in selecting the appropriate AZ. The impairment in new EC2 launches also affects services such as RDS, ECS, and Glue. We also recommend that Auto Scaling Groups are configured to use multiple AZs so that Auto Scaling can manage EC2 instance launches automatically.
Update
October 20, 2025 at 10:00 AMUTC
Update
October 20, 2025 at 10:00 AMUTC
Today on 01:00 PT, AWS experience an outage (us-east-1).
Most of AWS services have been restored, yet we are still facing issues with recovering all of our services.
Investigating
October 20, 2025 at 10:00 AMUTC
Investigating
October 20, 2025 at 10:00 AMUTC
Our auto-scaling is currently impacted due to an AWS EC2 outage. We are waiting for new instances to be provisioned in order to resume and complete all pending tasks.

Currents - Runs are not being updated and timed out – Incident details

All systems operational