Main app is unavailable
Incident Report for Strigo
Resolved
Starting on August 9th, 3:23 PM UTC, we've had a 19-minute downtime of our main app.

We deployed a version of our app that provides infrastructure for a new feature and had a bug that caused all of our main app's processes to peak at 100% CPU.

We reverted the change, verified that everything works, and deployed a version that fixed the issue.

By 3:42 PM UTC, the system was back to normal.

In retrospect:
* We could've had less downtime (could've been back in around 3m instead of 19), but a technicality in how we revert changes made it so that the first attempt to deploy a fix didn't actually deploy anything. We will optimize that.
* A more robust testing and rollout framework could've helped us find this problem before it reaches the production environment. This is something that's already a WIP, in fact.
Posted Aug 09, 2021 - 15:23 IDT