Currents - Runs are not being updated and timed out – Incident details

Runs are not being updated and timed out

Resolved
Major outage
Started 8 days agoLasted about 12 hours

Affected

API

Operational from 10:00 AM to 12:44 PM, Major outage from 12:44 PM to 10:21 PM

API - HTTP REST API

Operational from 10:00 AM to 12:44 PM, Major outage from 12:44 PM to 10:21 PM

API - Dashboard Browsing

Operational from 10:00 AM to 12:44 PM, Major outage from 12:44 PM to 10:21 PM

Data Injestion

Operational from 10:00 AM to 12:44 PM, Major outage from 12:44 PM to 10:21 PM

Data Pipeline

Operational from 10:00 AM to 12:44 PM, Major outage from 12:44 PM to 10:21 PM

Updates
  • Resolved
    Resolved

    Our systems are back to normal. After AWS restored their services we were able to start processing the incoming data. There's still a backlog of non-processes event accumulated during the outage.

    • Our focus is on processing the newly created runs without any delay.

    • Due to nature of our service there's less value in real-time processing of the delayed events because the associated runs are already expired and timed out.

    • The backlog of post-process still to be process for analytics and performance analysis.

    This outage revealed a few performance and resilience related issues with our system. We will follow up with a more detailed analysis.

  • Identified
    Identified

    Due to AWS us-east-1 outage we are still experiencing issues. There is a big backlog of data update tasks we are unable to process because of lack of EC2 resources.

    We are looking for alternative runtimes to restore the functionality.

    Relevant excerpt from AWS:


    [04:48 AM PDT] We continue to work to fully restore new EC2 launches in US-EAST-1. We recommend EC2 Instance launches that are not targeted to a specific Availability Zone (AZ) so that EC2 has flexibility in selecting the appropriate AZ. The impairment in new EC2 launches also affects services such as RDS, ECS, and Glue. We also recommend that Auto Scaling Groups are configured to use multiple AZs so that Auto Scaling can manage EC2 instance launches automatically.

  • Update
    Update

    Today on 01:00 PT, AWS experience an outage (us-east-1).
    Most of AWS services have been restored, yet we are still facing issues with recovering all of our services.

  • Investigating
    Investigating

    Our auto-scaling is currently impacted due to an AWS EC2 outage. We are waiting for new instances to be provisioned in order to resume and complete all pending tasks.