Strigo's labs are currently down

Incident Report for Strigo

Postmortem

What happened?

On November 11st at 12:25 pm UTC, customer data in the production database was deleted. This included only data regarding organization accounts and class templates.

At 4:52 pm UTC, the data was restored and the system became fully functional again.

How did this affect you?

Since all organization and class data was missing, the system had become practically unavailable, for existing customers and activity. New organization accounts or class templates could still be created.

How was it addressed?

We restored a backup of the production database in a different cluster.

The backup was taken on November 11st at 6:33 am UTC, so the gap in data between the backup and the time of the incident was about 6 hours.

The production database has remained as-is, and only the missing data (of the organization accounts and the class templates) was restored from the backup.

Once the missing data had been restored, existing customers’ activity could be done again.

Does it have further implications?

Changes in organizations (invites, role changes) or in class templates, between November 11st 6:33 am to 12:25 pm, were practically dismissed. Such changes will have to be applied manually again by the customers.

Other than that, there are no implications - on live training activity or on billing.

How did it happen?

The data was deleted due to a human error when troubleshooting a production problem in the database.

How did Strigo find out about it?

We got an alert from our monitoring system about a failed API health check.

This revealed that data about organization accounts had been deleted.

What is Strigo going to do about it?

We are still in the midst of exploring how exactly it happened.

Our actions will generally revolve around:

Making sure that access to the database is more restricted and controlled.
Implementing least-privileged policies better.
Improving Recovery Point Objective and Recovery Time Objective.

Last word

We apologize for the effect it had on you.

We strive to improve our service and put the reliability and security of your data and activity at the highest priority.

We are very critical of ourselves and will make sure to bring such incidents to the minimum.

Posted Nov 17, 2020 - 16:40 IST

Resolved

This incident has been resolved.

Posted Nov 11, 2020 - 16:52 IST

Update

We are continuing to work on a fix for this issue.

Posted Nov 11, 2020 - 15:56 IST

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 11, 2020 - 15:54 IST

Investigating

We are currently investigating this issue.

Posted Nov 11, 2020 - 14:25 IST

This incident affected: Strigo service.