High memory consumption making the app inaccessible

Incident Report for Strigo

Postmortem

Yesterday (the 21st of February, at 12:35 AM), we experienced downtime in our Classroom and Lounge applications that lasted 17m.

We noticed that our system was overall stable but that our main classroom app’s containers were showing high CPU usage across the board, which prevented our customers from accessing the app.

After a rather deep investigation, we conclude (without hard evidence) that this was probably related to network issues between our clients and the our datacenters, causing (due to the reactive nature of our app) the clients to bombard are main app with requests. We didn’t see any issues with our database, load balancers, or scheduling systems.

We’ve taken steps to ensure that if our main app isn’t available, it will be restarted automatically by our scheduler to clear up those requests.

We apologize for the inconvenience.

Posted Feb 22, 2023 - 11:16 IST

Resolved

We encountered a CPU and memory usage spike in our services.
This is suspected to be a result of a network connection issue that is now resolved.
After 4 hours of monitoring we feel safe to mark the issue as resolved.

Posted Feb 21, 2023 - 15:37 IST

Update

We are continuing to monitor the provided resolution.

Posted Feb 21, 2023 - 12:00 IST

Update

Resources were reallocated and the app is up and running.
We're monitoring and investigating the source that caused the issue.

Posted Feb 21, 2023 - 10:54 IST

Update

We are continuing to investigate this issue.

Posted Feb 21, 2023 - 10:52 IST

Investigating

We've encountered an issue making the app inaccessible.
Our engineers are working on resolving the issue

Posted Feb 21, 2023 - 10:35 IST

This incident affected: Website and Strigo service.