The system is now once again working as normal. During the incident you may have failed to create runs, and run data from the incident window may be missing in the dashboard, leading to timeout out runs and missing results.
New runs should now be working as expected as of Sat 11:30pm UTC
We are still closely monitoring for further errors. And we will be doing a deeper investigation into the root cause on Monday.
Several steps were taken to increase our database capacity during the incident, and canceling long running task. As well as a rollback of of some of our most recent deployed services.
It's not yet clear whether the rollback or the cancelling of long running tasks resolved the incident. We will be looking into the root cause deeper during regular business hours.