Currents - Notice history

100% - uptime

API - HTTP REST API - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

API - Dashboard Browsing - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

API - Reporting and Orchestration - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

Data Pipeline - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

Scheduler - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

3rd Party Integrations - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

Cypress Integration - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

Playwright Integration - Operational

100% - uptime
Jun 2024 · 100%Jul · 100.0%Aug · 100.0%
Jun 2024
Jul 2024
Aug 2024

Notice history

Aug 2024

Playwright Orchestration intermittent failures
  • Resolved
    Resolved

    This incident has been resolved

    All systems are operational.

    Impact Assessment

    • a few customers weren't able to create orchestrated runs for ~1.5h

    • the orchestration didn't produce optimal results for ~4 hours

    Root cause analysis

    At ~12:15pm PST (7:15pm UTC) our database cluster performed a failover from primary node to one of its secondary nodes as part of applying a maintenance patch.

    {"t":{"$date":"2024-08-27T19:12:21.629+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn126158","msg":"Authentication succeeded","attr":{"mechanism":"SCRAM-SHA-1","speculative":true,"principalName":"mms-automation","authenticationDatabase":"admin","remote":"192.168.253.167:50284","extraInfo":{}}}
    
    {"t":{"$date":"2024-08-27T19:15:01.090+00:00"},"s":"I",  "c":"COMMAND",  "id":21579,   "ctx":"conn126158","msg":"Attempting to step down in response to replSetStepDown command"}
    

    The patched server caused one of the queries to return a malformed response, which resulted in a cascading series of failures. As a result some customer were not able to create orchestrated runs.

    After releasing a hot patch that restored system stability, we released a permanent fix that restored the full service capacity.

    Update 2024-08-28 12:25AM PST

    The root cause was further refined to an issue in MongoDB update.

    The cluster underwent MongoDB Version Upgrade from v6.0.16 to v6.0.17. The plan for the upgrade completed at 08/27/2024, at 07:13:33 PM UTC which corresponds with the onset of the query issue.

    MongoDB internal team identified aggregation issues introduced in version v6.0.17 and has scheduled a downgrade the cluster to v6.0.16.

    MongoDB is progressively reverting the 6.0.17 upgrades back to 6.0.16, but I do not have a timeline for when your clusters will be done.

    We sincerely apologize for the inconvenience. Please let me know if you have any further questions.

  • Identified
    Identified

    We have identified the root cause and released a hot patch that prevent erroneous HTTP responses. The orchestration isn't distributing the tests in the optimal order at the moment, working on restoring the service to full capacity.

  • Investigating
    Investigating

    We are observing an increased errors rate for Playwright orchestration requests. We are currently investigating this incident.

Jun 2024

No notices reported this month

Jun 2024 to Aug 2024

Next