<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-US" xmlns="http://www.w3.org/2005/Atom">
  <id>tag:currents.instatus.com,2005:/history</id>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com"/>
  <link rel="self" type="application/atom+xml" href="https://currents.instatus.com/history.atom"/>
  <title>Currents Status - Incident history</title>
  <updated>2026-02-11T17:11:35.252+00:00</updated>
  <author>
    <name>Currents</name>
  </author>
  
<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmliaec6k0699sdts5k3wa8lp</id>
  <published>2026-02-11T17:11:35.252+00:00</published>
  <updated>2026-02-11T19:41:12.274+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmliaec6k0699sdts5k3wa8lp"/>
  <title>Issues with slow processing of results</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 5 hours and 11 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, API - HTTP REST API, Playwright Integration, Scheduler, Data Injestion, Cypress Integration, API - Dashboard Browsing, 3rd Party Integrations</p>
    <p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>19:41:12</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing work on full recovery. Initial results processing and notifications are back to normal, but some data queues that populate the dashboard charts and explorers are still delayed..</p>
<p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>22:22:54</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved. All queues are caught up..</p>
<p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>22:15:58</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We are almost fully recovered, though are still seeing a 10 minute delay on our analytics queues, that power some of the reports in the dashboard. These should finish catching up over the next half hour, with the delay improving steadily. .</p>
<p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>17:11:35</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident. But processing results has gotten significantly behind.   
  
We are also seeing an increase of errors in the endpoints that receive client requests from the reports..</p>
<p><small>Feb <var data-var='date'> 11</var>, <var data-var='time'>17:38:30</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on recovery. The endpoint errors have been resolved. But our queues are backed up, so results are taking a while to show in the dashboard.  
  .</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmj7du1b608aofsttwti2cz86</id>
  <published>2025-12-15T16:42:53.649+00:00</published>
  <updated>2025-12-15T17:23:24.898+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmj7du1b608aofsttwti2cz86"/>
  <title>Slow reported run completions</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 17 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline</p>
    <p><small>Dec <var data-var='date'> 15</var>, <var data-var='time'>17:23:24</var> GMT+0</small><br /><strong>Monitoring</strong> -
  The queue has now recovered, we will monitor the change as the extra processes we launched ramp back down to normal levels..</p>
<p><small>Dec <var data-var='date'> 15</var>, <var data-var='time'>18:00:16</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Dec <var data-var='date'> 15</var>, <var data-var='time'>16:42:53</var> GMT+0</small><br /><strong>Identified</strong> -
  Our background tasks queues have gotten behind and as a result run status in the dashboard and event notifications are currently delayed. 

This was due to a low auto-scaling cap put in place during our last incident (to limit database connections). This scaling issue has now been resolved but we will expect it to take some time for the system to catch up. And result notifications will likely continue to be delayed for the next hour or two..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmj3sv0he0cuw2jhx9hw6ge34</id>
  <published>2025-12-13T02:30:00.000+00:00</published>
  <updated>2025-12-13T02:30:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmj3sv0he0cuw2jhx9hw6ge34"/>
  <title>Slowness in test results ingestion rate, runs timing out</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 hours and 50 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, Playwright Integration, Data Injestion</p>
    <p><small>Dec <var data-var='date'> 13</var>, <var data-var='time'>02:30:00</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Dec <var data-var='date'> 13</var>, <var data-var='time'>05:20:09</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved. The preliminary root cause is a sudden spike in concurrent executions, which affected the DB capacity. After increasing the capacity and processing the backlog the system is back to normal. A more thorough investigation and mitigation are to follow to improve system stability due to surge in concurrent requests..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmgz3lffb00k3135qod5do6bc</id>
  <published>2025-10-20T10:00:00.000+00:00</published>
  <updated>2025-10-20T10:00:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmgz3lffb00k3135qod5do6bc"/>
  <title>Runs are not being updated and timed out</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 12 hours and 22 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, API - HTTP REST API, , Data Injestion, API - Dashboard Browsing, 
API →</p>
    <p><small>Oct <var data-var='date'> 20</var>, <var data-var='time'>10:00:00</var> GMT+0</small><br /><strong>Investigating</strong> -
  Today on 01:00 PT, AWS experience an outage (us-east-1).   
Most of AWS services have been restored, yet we are still facing issues with recovering all of our services.   .</p>
<p><small>Oct <var data-var='date'> 20</var>, <var data-var='time'>10:00:00</var> GMT+0</small><br /><strong>Investigating</strong> -
  Our auto-scaling is currently impacted due to an AWS EC2 outage. We are waiting for new instances to be provisioned in order to resume and complete all pending tasks..</p>
<p><small>Oct <var data-var='date'> 20</var>, <var data-var='time'>12:44:55</var> GMT+0</small><br /><strong>Identified</strong> -
  Due to AWS us-east-1 outage we are still experiencing issues. There is a big backlog of data update tasks we are unable to process because of lack of EC2 resources. 

We are looking for alternative runtimes to restore the functionality.

Relevant excerpt from AWS:

  
\[04:48 AM PDT\] We continue to work to fully restore new EC2 launches in US-EAST-1\. We recommend EC2 Instance launches that are not targeted to a specific Availability Zone (AZ) so that EC2 has flexibility in selecting the appropriate AZ. The impairment in new EC2 launches also affects services such as RDS, ECS, and Glue. We also recommend that Auto Scaling Groups are configured to use multiple AZs so that Auto Scaling can manage EC2 instance launches automatically..</p>
<p><small>Oct <var data-var='date'> 20</var>, <var data-var='time'>22:21:31</var> GMT+0</small><br /><strong>Resolved</strong> -
  Our systems are back to normal. After AWS restored their services we were able to start processing the incoming data. There&#039;s still a backlog of non-processes event accumulated during the outage. 

* Our focus is on processing the newly created runs without any delay.
* Due to nature of our service there&#039;s less value in real-time processing of the delayed events because the associated runs are already expired and timed out.
* The backlog of post-process still to be process for analytics and performance analysis.

This outage revealed a few performance and resilience related issues with our system. We will follow up with a more detailed analysis..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmelqcp9x0007cdjo89tuc9iu</id>
  <published>2025-08-21T18:23:28.377+00:00</published>
  <updated>2025-08-21T18:23:28.377+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmelqcp9x0007cdjo89tuc9iu"/>
  <title>Performance issues for Cypress uploads and users on the us east coast.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 hours and 1 minute</p>
    <p><strong>Affected Components:</strong> Data Injestion</p>
    <p><small>Aug <var data-var='date'> 21</var>, <var data-var='time'>18:23:28</var> GMT+0</small><br /><strong>Identified</strong> -
  Due to a network issue, uploads may be slow. Cypress uploads of logs in all regions are impacted (because our ingest server for these is in the impacted area), and playwright artifact uploads from the us east coast..</p>
<p><small>Aug <var data-var='date'> 21</var>, <var data-var='time'>19:06:36</var> GMT+0</small><br /><strong>Monitoring</strong> -
  The issue looks to have been resolved. We are continuing to monitor and confirm..</p>
<p><small>Aug <var data-var='date'> 21</var>, <var data-var='time'>20:24:18</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmeh9yhli000vtpndvhzjnov2</id>
  <published>2025-08-18T15:33:32.503+00:00</published>
  <updated>2025-08-18T15:33:32.503+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmeh9yhli000vtpndvhzjnov2"/>
  <title>Increased errors and timeouts for reporting runs and uploading artifacts</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 hours and 43 minutes</p>
    <p><strong>Affected Components:</strong> Data Injestion</p>
    <p><small>Aug <var data-var='date'> 18</var>, <var data-var='time'>15:33:32</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Aug <var data-var='date'> 18</var>, <var data-var='time'>16:46:15</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We are no longer experiencing the issue, but are still investigating the cause. .</p>
<p><small>Aug <var data-var='date'> 18</var>, <var data-var='time'>18:17:00</var> GMT+0</small><br /><strong>Resolved</strong> -
  Incident resolved.

We had a partial outage with our object storage provider that resulting in some uploads and requests timing out.  
  
We also had issues with our retry logic triggering access limit. We will look into improving this area..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmapr3njg000inqbdo4jjmhvm</id>
  <published>2025-05-15T19:05:00.000+00:00</published>
  <updated>2025-05-15T19:17:38.860+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmapr3njg000inqbdo4jjmhvm"/>
  <title>Failure to upload artifacts</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 13 minutes</p>
    <p><strong>Affected Components:</strong> Playwright Integration, Data Injestion</p>
    <p><small>May <var data-var='date'> 15</var>, <var data-var='time'>19:17:38</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>May <var data-var='date'> 15</var>, <var data-var='time'>19:05:00</var> GMT+0</small><br /><strong>Identified</strong> -
  A failed deployment caused a 10-minutes outage. 

During that outage test reporters were not able to upload artifacts.

After receiving the alerts we rolled back to failed deployment and the system went back to normal. .</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cmag1at3l001dtz1nglwaxh0o</id>
  <published>2025-05-09T00:04:35.902+00:00</published>
  <updated>2025-05-09T00:04:35.902+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cmag1at3l001dtz1nglwaxh0o"/>
  <title>Timeouts and missing results for new runs</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 4 minutes</p>
    <p><strong>Affected Components:</strong> API - Dashboard Browsing, API - HTTP REST API</p>
    <p><small>May <var data-var='date'> 9</var>, <var data-var='time'>00:04:35</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident, where our elasticsearch cluster has become unhealthy, and is impacting the processing of results..</p>
<p><small>May <var data-var='date'> 9</var>, <var data-var='time'>00:59:23</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a mitigation and are currently monitoring the result. As well as continuing to investigate the cause..</p>
<p><small>May <var data-var='date'> 9</var>, <var data-var='time'>01:08:08</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cma1c55bt0011r0rsjs9m64y1</id>
  <published>2025-04-28T17:11:34.433+00:00</published>
  <updated>2025-04-28T17:11:34.433+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cma1c55bt0011r0rsjs9m64y1"/>
  <title>Timeouts and missing results for new runs</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 hours and 45 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, Data Injestion, API - HTTP REST API, API - Dashboard Browsing</p>
    <p><small>Apr <var data-var='date'> 28</var>, <var data-var='time'>17:11:34</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Apr <var data-var='date'> 28</var>, <var data-var='time'>18:46:38</var> GMT+0</small><br /><strong>Identified</strong> -
  The issue is caused by an increased memory pressure on an infrastructure component. We are performing an adjustment to it capacity and taking care of the pending tasks..</p>
<p><small>Apr <var data-var='date'> 28</var>, <var data-var='time'>19:56:08</var> GMT+0</small><br /><strong>Resolved</strong> -
  The system is back to normal.

### Initial Technical Analysis

* A sudden increase in ingress traffic caused one of the OLAP DB to throttle new write and read requests
* That caused an accumulation of backlog tasks - an autoscaling kicked off but couldn&#039;t cope with the backlog
* Increasing the infrastructure capacity resolved the issue.

### Impact

The issue affected test results reported between April 25 \~4:30pm-7:30pm GMT

* runs could be marked as timed out
* test results reporting is missing or delayed for affected runs
* CI jobs displaying warnings and connectivity error.</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cm8j28qzk006icafbpxypt8uv</id>
  <published>2025-03-21T14:30:00.000+00:00</published>
  <updated>2025-03-21T17:30:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cm8j28qzk006icafbpxypt8uv"/>
  <title>Rest API endpoint recovery</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 3 hours</p>
    <p><strong>Affected Components:</strong> API - HTTP REST API</p>
    <p><small>Mar <var data-var='date'> 21</var>, <var data-var='time'>17:30:00</var> GMT+0</small><br /><strong>Resolved</strong> -
  The error has been resolved, and the proper configuration restored. An internal issue has been logged to review the incident and look for future deployment improvements..</p>
<p><small>Mar <var data-var='date'> 21</var>, <var data-var='time'>14:30:00</var> GMT+0</small><br /><strong>Monitoring</strong> -
  Some incompatible settings from a failed deployment were left in place after a rollback, and our rest api endpoint was unreachable for the past 3hrs.   

Spot instance reset requests, and calls directly to our api endpoints were impacted.

  
We implemented a fix and are currently monitoring the result..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cm5smxif200gz903c98och9sv</id>
  <published>2025-01-11T20:24:48.991+00:00</published>
  <updated>2025-01-11T20:24:48.991+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cm5smxif200gz903c98och9sv"/>
  <title>Slowdowns and high load</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 7 hours and 38 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, Data Injestion</p>
    <p><small>Jan <var data-var='date'> 11</var>, <var data-var='time'>20:24:48</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident.

We are seeing an elevated number of errors, as well as load on our database. Run processing has been significantly delayed. We are investigating..</p>
<p><small>Jan <var data-var='date'> 11</var>, <var data-var='time'>23:15:30</var> GMT+0</small><br /><strong>Identified</strong> -
  We have identified a bottleneck in DB writes that caused slow processing and accumulation of tasks in processing queues. Scaling up the DB cluster and terminating long-running operations restored the system stability.

Investigating the root case for the increase in DB resources consumption..</p>
<p><small>Jan <var data-var='date'> 12</var>, <var data-var='time'>01:17:32</var> GMT+0</small><br /><strong>Monitoring</strong> -
  The system is now once again working as normal. During the incident you may have failed to create runs, and run data from the incident window may be missing in the dashboard, leading to timeout out runs and missing results.

New runs should now be working as expected as of Sat 11:30pm UTC

We are still closely monitoring for further errors. And we will be doing a deeper investigation into the root cause on Monday.

Several steps were taken to increase our database capacity during the incident, and canceling long running task. As well as a rollback of of some of our most recent deployed services. 

It&#039;s not yet clear whether the rollback or the cancelling of long running tasks resolved the incident. We will be looking into the root cause deeper during regular business hours. .</p>
<p><small>Jan <var data-var='date'> 12</var>, <var data-var='time'>04:02:50</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved.

The root cause was identified as a bug in telemetry configuration for background tasks queue.

Adding telemetry data broke tasks deduplication, causing DB overload due to redundant write requests..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cm4vkjgjh002klvmk6xvqujo9</id>
  <published>2024-12-19T17:01:30.608+00:00</published>
  <updated>2024-12-19T17:01:30.608+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cm4vkjgjh002klvmk6xvqujo9"/>
  <title>An increase in errors and delays from our services</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 3 hours and 1 minute</p>
    <p><strong>Affected Components:</strong> API - HTTP REST API, API - Dashboard Browsing, Data Injestion</p>
    <p><small>Dec <var data-var='date'> 19</var>, <var data-var='time'>17:01:30</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Dec <var data-var='date'> 19</var>, <var data-var='time'>18:01:44</var> GMT+0</small><br /><strong>Investigating</strong> -
  We have reverted the most recent release, and the error rates have gone down.

We are still experiencing some slowdowns due to the task backlog that grew during the incident, but the system is mostly recovered.

Runs that were created during the outage will be marked as timed out, and you may see errors related to those runs in your local clients if they are still in progress.

New runs should not be affected, and should now work as expected.

We are still actively investigating the root cause..</p>
<p><small>Dec <var data-var='date'> 19</var>, <var data-var='time'>18:23:31</var> GMT+0</small><br /><strong>Identified</strong> -
  We are still seeing issues as a result of the backlog, we are scaling up to address the backlog of tasks.

We are continuing to investigate the cause in the meantime..</p>
<p><small>Dec <var data-var='date'> 19</var>, <var data-var='time'>19:05:54</var> GMT+0</small><br /><strong>Monitoring</strong> -
  Our backlog has recovered, and the service is back to normal.

The root cause appears to have been changes in our now reverted deployment. We will continue to investigate the cause and provide more details after our investigation..</p>
<p><small>Dec <var data-var='date'> 19</var>, <var data-var='time'>20:02:32</var> GMT+0</small><br /><strong>Resolved</strong> -
  The system has recovered. We will continue to investigate the root cause and will update the description of the incident with the details when we have them..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cm2umsoyg0019c7a98i9p2kj4</id>
  <published>2024-10-29T13:00:00.000+00:00</published>
  <updated>2024-10-29T13:00:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cm2umsoyg0019c7a98i9p2kj4"/>
  <title>Data processing delayed</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 4 hours and 48 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline</p>
    <p><small>Oct <var data-var='date'> 29</var>, <var data-var='time'>13:00:00</var> GMT+0</small><br /><strong>Monitoring</strong> -
  The service is now recovered, we are monitoring for additional errors.

We experienced a scaling issue with our data processing service. This is resulting in run results and run start handling being delayed. Some delays may have resulted in failure to record runs.   .</p>
<p><small>Oct <var data-var='date'> 29</var>, <var data-var='time'>17:47:55</var> GMT+0</small><br /><strong>Resolved</strong> -
  ## This incident has been resolved

All systems are operational.

## Impact Assessment

* customers run results were significantly delayed for \~5 hours
* the delays caused other issues with services that expected the results like webhook and integrations.

### 

## Root cause analysis

At \~8:20pm PST (3:20am UTC) we had an internal cleanup task run that removed old docker images used to deploy our data processing service. An oversight in how this was configured vs our processes resulted in the currently deployed service&#039;s docker image being removed.

Between \~4:00am PST (11:00am UTC) and \~9:00am PST (4pm UTC) the system was not able to successfully take any scaling action while it tried to deploy the deleted image. There were still a small number of instances still running and processing tasks, but not enough to deal with the load.  
  
Once we re-deployed a newer image, we were able to quickly recover to normal operating status. .</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cm0cvwjfp000bx9j41jr2ixkn</id>
  <published>2024-08-27T20:33:08.926+00:00</published>
  <updated>2024-08-27T20:33:08.926+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cm0cvwjfp000bx9j41jr2ixkn"/>
  <title>Playwright Orchestration intermittent failures </title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 4 hours and 10 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, Playwright Integration, Data Injestion</p>
    <p><small>Aug <var data-var='date'> 27</var>, <var data-var='time'>20:33:08</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are observing an increased errors rate for Playwright orchestration requests. We are currently investigating this incident. .</p>
<p><small>Aug <var data-var='date'> 27</var>, <var data-var='time'>22:19:31</var> GMT+0</small><br /><strong>Identified</strong> -
  We have identified the root cause and released a hot patch that prevent erroneous HTTP responses. The orchestration isn&#039;t distributing the tests in the optimal order at the moment, working on restoring the service to full capacity..</p>
<p><small>Aug <var data-var='date'> 28</var>, <var data-var='time'>00:42:54</var> GMT+0</small><br /><strong>Resolved</strong> -
  ## This incident has been resolved

All systems are operational.

### 

## Impact Assessment

* a few customers weren&#039;t able to create orchestrated runs for \~1.5h
* the orchestration didn&#039;t produce optimal results for \~4 hours

### 

## Root cause analysis

At \~12:15pm PST (7:15pm UTC) our database cluster performed a failover from primary node to one of its secondary nodes as part of applying a maintenance patch.

```log
{&quot;t&quot;:{&quot;$date&quot;:&quot;2024-08-27T19:12:21.629+00:00&quot;},&quot;s&quot;:&quot;I&quot;,  &quot;c&quot;:&quot;ACCESS&quot;,   &quot;id&quot;:20250,   &quot;ctx&quot;:&quot;conn126158&quot;,&quot;msg&quot;:&quot;Authentication succeeded&quot;,&quot;attr&quot;:{&quot;mechanism&quot;:&quot;SCRAM-SHA-1&quot;,&quot;speculative&quot;:true,&quot;principalName&quot;:&quot;mms-automation&quot;,&quot;authenticationDatabase&quot;:&quot;admin&quot;,&quot;remote&quot;:&quot;192.168.253.167:50284&quot;,&quot;extraInfo&quot;:{}}}

```

```log
{&quot;t&quot;:{&quot;$date&quot;:&quot;2024-08-27T19:15:01.090+00:00&quot;},&quot;s&quot;:&quot;I&quot;,  &quot;c&quot;:&quot;COMMAND&quot;,  &quot;id&quot;:21579,   &quot;ctx&quot;:&quot;conn126158&quot;,&quot;msg&quot;:&quot;Attempting to step down in response to replSetStepDown command&quot;}

```

The patched server caused one of the queries to return a malformed response, which resulted in a cascading series of failures. As a result some customer were not able to create orchestrated runs.

After releasing a hot patch that restored system stability, we released a permanent fix that restored the full service capacity.

## Update 2024-08-28 12:25AM PST

The root cause was further refined to an issue in MongoDB update.

_The cluster underwent MongoDB Version Upgrade from v6.0.16 to v6.0.17\. The plan for the upgrade completed at_ `08/27/2024, at 07:13:33 PM UTC` _which corresponds with the onset of the query issue._ 
_​_ 
_MongoDB internal team identified aggregation issues introduced in version v6.0.17 and has scheduled a downgrade the cluster to v6.0.16._ 
  
_MongoDB is progressively reverting the 6.0.17 upgrades back to 6.0.16, but I do not have a timeline for when your clusters will be done._ 
_​_

_We sincerely apologize for the inconvenience. Please let me know if you have any further questions._.</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clz74zv6u45567gzn1wqrzvl8f</id>
  <published>2024-07-29T15:21:22.134+00:00</published>
  <updated>2024-07-29T15:21:22.134+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clz74zv6u45567gzn1wqrzvl8f"/>
  <title>Run timeouts end unresponsive results</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 hours and 4 minutes</p>
    <p><strong>Affected Components:</strong> API - HTTP REST API, API - Dashboard Browsing, Data Pipeline, Data Injestion</p>
    <p><small>Jul <var data-var='date'> 29</var>, <var data-var='time'>15:21:22</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident.

We&#039;re seeing a spike in processing load, that has resulting in a slowdown in results processing. This is having other cascading effects on how results and run progress is being shown in our system, including causing run timeouts..</p>
<p><small>Jul <var data-var='date'> 29</var>, <var data-var='time'>17:24:53</var> GMT+0</small><br /><strong>Resolved</strong> -
  Incident is now resolved. We&#039;ve identified the poorly performing query that was the cause..</p>
<p><small>Jul <var data-var='date'> 29</var>, <var data-var='time'>16:03:28</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We&#039;ve increased our task processing capacity, and have restored performance.

We continue to investigate and monitor..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clwrzhkqc0523bon84gdzmrav</id>
  <published>2024-05-29T15:31:13.343+00:00</published>
  <updated>2024-05-29T15:31:13.343+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clwrzhkqc0523bon84gdzmrav"/>
  <title>In-app support is currently experiencing issues.</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 52 minutes</p>
    <p><strong>Affected Components:</strong> API - Dashboard Browsing</p>
    <p><small>May <var data-var='date'> 29</var>, <var data-var='time'>15:31:13</var> GMT+0</small><br /><strong>Identified</strong> -
  Our in-app support chat is currently experiencing issues. If you need support, please reach out using our support email. 

Our chat provider&#039;s incident status can be followed here: &lt;https://www.intercomstatus.com/incidents/lxsqvttdjj76&gt;.</p>
<p><small>May <var data-var='date'> 29</var>, <var data-var='time'>16:35:44</var> GMT+0</small><br /><strong>Monitoring</strong> -
  Chat support should now be working, but may be a little slow. If you have trouble with it loading, please try refreshing the page at least once and trying again.

Our email support is always available if you are running into troubles..</p>
<p><small>May <var data-var='date'> 29</var>, <var data-var='time'>17:23:04</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cluvkg8cp51353b7nakvrqzgew</id>
  <published>2024-04-11T17:30:00.000+00:00</published>
  <updated>2024-04-11T19:15:19.914+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cluvkg8cp51353b7nakvrqzgew"/>
  <title>Dashboard run timeouts and slowness on seeing results in the dashboard</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 45 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline</p>
    <p><small>Apr <var data-var='date'> 11</var>, <var data-var='time'>19:15:19</var> GMT+0</small><br /><strong>Resolved</strong> -
  The system has recovered.</p>
<p><small>Apr <var data-var='date'> 11</var>, <var data-var='time'>17:30:00</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We have increased our processing capacity, and are monitoring as the system catches up with the load.</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clsf87d7w139082bjn24jp9b18d</id>
  <published>2024-02-09T22:35:24.063+00:00</published>
  <updated>2024-02-09T22:35:24.063+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clsf87d7w139082bjn24jp9b18d"/>
  <title>Missing test results for some executions</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 1 hour and 3 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline</p>
    <p><small>Feb <var data-var='date'> 9</var>, <var data-var='time'>22:35:24</var> GMT+0</small><br /><strong>Investigating</strong> -
  As part of planned maintenance with no expected downtime, we&#039;ve detected that a small percentage of test results missing. We are currently investigating this incident..</p>
<p><small>Feb <var data-var='date'> 9</var>, <var data-var='time'>23:38:23</var> GMT+0</small><br /><strong>Resolved</strong> -
  ✅ The root cause has been identified and classified as a false negative. 

Spec files with no tests were mistakenly identified as missing results. No real impact on customer data..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cllbi2pgx10251bencuwzwxatl</id>
  <published>2023-08-14T23:22:42.169+00:00</published>
  <updated>2023-08-15T17:39:35.093+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cllbi2pgx10251bencuwzwxatl"/>
  <title>Intermittent parallelization errors </title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 23 hours and 53 minutes</p>
    <p><strong>Affected Components:</strong> API - HTTP REST API, API - Dashboard Browsing, Data Injestion</p>
    <p><small>Aug <var data-var='date'> 15</var>, <var data-var='time'>17:39:35</var> GMT+0</small><br /><strong>Monitoring</strong> -
  Scaling up Redis cluster completed. We are currently monitoring the system performance..</p>
<p><small>Aug <var data-var='date'> 14</var>, <var data-var='time'>23:22:42</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are experiencing intermittent parallelization-related issues. Arbitrary can fail to report their results. https://currents.instatus.com

.</p>
<p><small>Aug <var data-var='date'> 14</var>, <var data-var='time'>23:36:33</var> GMT+0</small><br /><strong>Identified</strong> -
  The root cause has been identified - the backed Redis cluster exceeded the allowed network bandwidth limits, causing some requests to drop. Performing cluster reconfiguration..</p>
<p><small>Aug <var data-var='date'> 15</var>, <var data-var='time'>00:14:13</var> GMT+0</small><br /><strong>Monitoring</strong> -
  Reconfiguring the Redis configuration was completed and the # of errors dropped to 0. Working on follow-up tasks to ensure better scalability. Monitoring the errors. .</p>
<p><small>Aug <var data-var='date'> 15</var>, <var data-var='time'>06:36:09</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Aug <var data-var='date'> 15</var>, <var data-var='time'>16:38:51</var> GMT+0</small><br /><strong>Identified</strong> -
  Reverting the status back to &quot;Identified.&quot;

Scaling up the cluster to accommodate the increased demand. The backed Redis cluster exceeded the allowed network bandwidth limits, causing some requests to drop. .</p>
<p><small>Aug <var data-var='date'> 15</var>, <var data-var='time'>23:15:27</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clkbmc6wf227564choc1jjrtvvj</id>
  <published>2023-07-20T20:42:20.567+00:00</published>
  <updated>2023-07-20T23:43:33.560+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clkbmc6wf227564choc1jjrtvvj"/>
  <title>Analytics and metrics partial outage</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 9 hours and 23 minutes</p>
    <p><strong>Affected Components:</strong> API - Dashboard Browsing, API - HTTP REST API, , 
API →</p>
    <p><small>Jul <var data-var='date'> 20</var>, <var data-var='time'>23:43:33</var> GMT+0</small><br /><strong>Monitoring</strong> -
  The cluster rebalancing is complete, enabling all the ES queries, monitoring the performance and resolution..</p>
<p><small>Jul <var data-var='date'> 21</var>, <var data-var='time'>06:05:37</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>
<p><small>Jul <var data-var='date'> 20</var>, <var data-var='time'>20:42:20</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are dealing with a partial outage with our analytics systems. Insights and performance metrics are temporarily unavailable..</p>
<p><small>Jul <var data-var='date'> 20</var>, <var data-var='time'>22:55:30</var> GMT+0</small><br /><strong>Identified</strong> -
  The root cause was identified as a failed allocator for the ElasticSearch cluster. 

A failure in one of the ElasticSearch nodes caused and increased CPU and load for the whole cluster, we are relocating the data to a different node and rebalancing the shards between the new nodes.

We turned off some search queries to temporarily reduce the cluster load and reduce the recovery time. .</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cliuo3jmi60262bemzrsba5cvj</id>
  <published>2023-06-13T18:19:00.000+00:00</published>
  <updated>2023-06-13T18:19:00.000+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cliuo3jmi60262bemzrsba5cvj"/>
  <title>50x responses from cloud services</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 11 hours and 29 minutes</p>
    <p><strong>Affected Components:</strong> , 3rd Party Integrations, Data Injestion, 
API →</p>
    <p><small>Jun <var data-var='date'> 13</var>, <var data-var='time'>18:19:00</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>Jun <var data-var='date'> 13</var>, <var data-var='time'>19:25:07</var> GMT+0</small><br /><strong>Identified</strong> -
  The issue is caused by the widespread [AWS outage](https://health.aws.amazon.com/health/home#/account/dashboard/open-issues?eventID=arn:aws:health:us-east-1::event/MULTIPLE_SERVICES/AWS_MULTIPLE_SERVICES_OPERATIONAL_ISSUE/AWS_MULTIPLE_SERVICES_OPERATIONAL_ISSUE_798F8_DDDD18AFADD&amp;eventTab=details)
Switching to an alternative runtime.....</p>
<p><small>Jun <var data-var='date'> 13</var>, <var data-var='time'>20:06:54</var> GMT+0</small><br /><strong>Identified</strong> -
  We are attempting to switch the workload from AWS Lambda to AWS ECS, however, the commands are failing with a timeout.
Looking for alternative ways of changing the infrastructure setup.
.</p>
<p><small>Jun <var data-var='date'> 13</var>, <var data-var='time'>21:17:39</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a fix and are currently monitoring the result..</p>
<p><small>Jun <var data-var='date'> 14</var>, <var data-var='time'>05:48:13</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clfu4gso389254893omxfc107a0</id>
  <published>2023-03-29T20:11:08.243+00:00</published>
  <updated>2023-03-29T20:11:08.243+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clfu4gso389254893omxfc107a0"/>
  <title>Runs are not completing due to timeouts</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 18 minutes</p>
    <p><strong>Affected Components:</strong> Data Injestion, Data Pipeline, Cypress Integration</p>
    <p><small>Mar <var data-var='date'> 29</var>, <var data-var='time'>20:11:08</var> GMT+0</small><br /><strong>Identified</strong> -
  The root cause is identified, rolling back the faulty deployment.</p>
<p><small>Mar <var data-var='date'> 29</var>, <var data-var='time'>20:20:55</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a fix and are currently monitoring the result..</p>
<p><small>Mar <var data-var='date'> 29</var>, <var data-var='time'>20:29:29</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved. The root cause was identified as an issue with downstream npm dependency.</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clf05506d675085haojmti1po47</id>
  <published>2023-03-08T20:36:52.414+00:00</published>
  <updated>2023-03-08T20:36:52.414+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clf05506d675085haojmti1po47"/>
  <title>Cypress Runners version 12.6.0+ exit with &quot;Unsupported Recording Service&quot;</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 6 days and 20 minutes</p>
    <p><strong>Affected Components:</strong> Cypress Integration</p>
    <p><small>Mar <var data-var='date'> 8</var>, <var data-var='time'>20:36:52</var> GMT+0</small><br /><strong>Identified</strong> -
  Cypress 12.6.0+ introduced a breaking change to Currents integration - you may see an error message `Unsupported recording service`. 

We are working on restoring the integration. Meanwhile, please downgrade cypress runner to version 12.5.0. We apologize for the inconvenience, a separate communication will follow..</p>
<p><small>Mar <var data-var='date'> 10</var>, <var data-var='time'>11:01:11</var> GMT+0</small><br /><strong>Resolved</strong> -
  Customers experiencing &quot;Unsupported Recording Service&quot; please consider upgrading to https://github.com/currents-dev/cypress-cloud package.</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/clesvrxct94755honb1n3r4une</id>
  <published>2023-03-03T18:40:22.372+00:00</published>
  <updated>2023-03-03T18:40:22.372+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/clesvrxct94755honb1n3r4une"/>
  <title>API - Dashboard degraded performance</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 days, 12 hours and 27 minutes</p>
    <p><strong>Affected Components:</strong> , 
API →</p>
    <p><small>Mar <var data-var='date'> 3</var>, <var data-var='time'>18:40:22</var> GMT+0</small><br /><strong>Investigating</strong> -
  Slow response time when loading the dashboard and getting API responses. We are currently investigating this incident..</p>
<p><small>Mar <var data-var='date'> 4</var>, <var data-var='time'>07:49:58</var> GMT+0</small><br /><strong>Monitoring</strong> -
  We implemented a fix and are currently monitoring the result..</p>
<p><small>Mar <var data-var='date'> 6</var>, <var data-var='time'>07:07:13</var> GMT+0</small><br /><strong>Resolved</strong> -
  This incident has been resolved..</p>

        ]]>
  </content>
</entry>

<entry>
  <id>tag:currents.instatus.com,2005:Incident/cl3ix78jn0560t9obbmk88arp</id>
  <published>2022-05-23T16:05:48.781+00:00</published>
  <updated>2022-05-23T16:05:48.781+00:00</updated>
  <link rel="alternate" type="text/html" href="https://currents.instatus.com/incident/cl3ix78jn0560t9obbmk88arp"/>
  <title>API - Dashboard degraded performance</title>

  <content type="html">
  <![CDATA[
    <p><strong>Type:</strong> Incident</p>
    <p><strong>Duration:</strong> 2 hours and 32 minutes</p>
    <p><strong>Affected Components:</strong> Data Pipeline, , 
API →</p>
    <p><small>May <var data-var='date'> 23</var>, <var data-var='time'>16:05:48</var> GMT+0</small><br /><strong>Investigating</strong> -
  We are currently investigating this incident..</p>
<p><small>May <var data-var='date'> 23</var>, <var data-var='time'>18:38:12</var> GMT+0</small><br /><strong>Resolved</strong> -
  We just resolved the issue!.</p>
<p><small>May <var data-var='date'> 23</var>, <var data-var='time'>16:05:48</var> GMT+0</small><br /><strong>Identified</strong> -
  We are continuing to work on a fix for this incident. 

- The requests to a downstream Elasticsearch dependency have been timing out.
- Initiated increase of Elasticsearch cluster capacity.</p>
<p><small>May <var data-var='date'> 23</var>, <var data-var='time'>17:42:47</var> GMT+0</small><br /><strong>Monitoring</strong> -
  - We implemented a fix and currently monitoring the result
- Due to the outage some items take longer than expected to appear in the dashboard.
- We are running a process to validate data integrity and backfilling the missing data in Elasticsearch cluster..</p>

        ]]>
  </content>
</entry>

</feed>