Yesterday (the 21st of February, at 12:35 AM), we experienced downtime in our Classroom and Lounge applications that lasted 17m.
We noticed that our system was overall stable but that our main classroom app’s containers were showing high CPU usage across the board, which prevented our customers from accessing the app.
After a rather deep investigation, we conclude (without hard evidence) that this was probably related to network issues between our clients and the our datacenters, causing (due to the reactive nature of our app) the clients to bombard are main app with requests. We didn’t see any issues with our database, load balancers, or scheduling systems.
We’ve taken steps to ensure that if our main app isn’t available, it will be restarted automatically by our scheduler to clear up those requests.
We apologize for the inconvenience.