Currents - Timeouts and missing results for new runs – Incident details

Timeouts and missing results for new runs

Resolved
Major outage
Started 2 days agoLasted about 3 hours

Affected

Ingest and Orchestration

Partial outage from 5:11 PM to 6:46 PM, Degraded performance from 6:46 PM to 7:56 PM

API

Partial outage from 5:11 PM to 7:56 PM

API - HTTP REST API

Partial outage from 5:11 PM to 7:56 PM

API - Dashboard Browsing

Partial outage from 5:11 PM to 7:56 PM

Data Pipeline

Major outage from 5:11 PM to 6:46 PM, Operational from 6:46 PM to 7:56 PM

Updates
  • Resolved
    Resolved

    The system is back to normal.

    Initial Technical Analysis

    • A sudden increase in ingress traffic caused one of the OLAP DB to throttle new write and read requests

    • That caused an accumulation of backlog tasks - an autoscaling kicked off but couldn't cope with the backlog

    • Increasing the infrastructure capacity resolved the issue.

    Impact

    The issue affected test results reported between April 25 ~4:30pm-7:30pm GMT

    • runs could be marked as timed out

    • test results reporting is missing or delayed for affected runs

    • CI jobs displaying warnings and connectivity error

  • Identified
    Identified

    The issue is caused by an increased memory pressure on an infrastructure component. We are performing an adjustment to it capacity and taking care of the pending tasks.

  • Investigating
    Investigating
    We are currently investigating this incident.