Incident Summary
We want to provide you more details about this week’s Crowdmap outage.
On Monday, March 6, 2012 08:00 UTC we issued an update to the platform. This update was meant to take orphaned reports (reports that are in the system, but cannot be viewed because they have no category) and allocate them to a new category, created specifically for orphaned reports. Unfortunately, this update had a bug that was not caught in the testing done prior to application. On some deployments, orphaned reports were mixed with reports in another category and subsequently, all reports in that category were marked as unapproved (made so only administrators could see them).
After troubleshooting the issue, we determined that we needed to restore the database to our last known good configuration. Unfortunately the regular database snapshot was on a 24-hour schedule. The last available snapshot was almost 24 hours prior to the change. After weighing our options, we determined that the best path was to restore from the database backup rather than having Crowdmaps with wrongly allocated reports for 24 hours. Without doing this, reports would have been miscategorized and hidden from public view. We had no way to determine which reports were public and which were private. By restoring from the backup, we eliminated the possibility of exposing information on deployments that weren’t meant to be public.
The process would be two steps: take a database snapshot of the current state (with the orphan report issue.) then restore to the older backup with the most accurate data. All Crowdmap deployments would be restored to active data as of Monday, March 5 02:30 UTC. This ensured that no active data will be corrupted. We anticipated that the restore might have some data missing. With the snapshot of the most current data, we planned to run a comparison of the deployment data sets once the restore was complete.
The database restoration process took much longer than we anticipated. We escalated with our hosting provider. On the morning of March 7, 2012 10:00 UTC, our hosting provider offered us two alternative courses of action for the restore. We selected the Bare Metal Restore. Or, at least we thought we had. The database restoration took most of March 7th UTC. Once we ascertained that the restore was going slower than expected, we escalated again. On the morning of March 8th UTC, we were advised that the restore actually simply followed the previous restore path. This was corrected by our provider. Within 4 hours, we were back online and began testing.
Crowdmap Services were Offline from the morning of March 6 - 8, 2012 for 44 hours.
Lessons Learned and Changes
Crowdmap customers deserve uptime. We were already working on a redundancy and scalability project for the Crowdmap service and this downtime has made this project our sole effort on the Crowdmap team. Everything is changing from how we deploy new code and database changes to the architecture of the databases themselves. The timeline has been set for April. We will be announcing a maintenance window for that change closer to the date. This change will include an improved backup plan so we will reduce the possibility of major outages like this from occurring again.
We’re working on a more rigorous quality assurance program for changes. We will add simulation plans to test our responses for emergencies. And lastly, we will be creating a deployer alerting email list. You can opt in to receive notifications for outages or important updates.
Some of these changes will take time but we want to let you know that we are committed to meeting the lessons we’ve learned with action. Thank you to each of our deployers and community members for their support and patience.