Crowdmap: We're making changes

As a policy we try to keep the community as informed as possible when major service disruptions occur. Last Friday, March 30, 2012, Ushahidi and Crowdmap experienced a period of extended downtime due to a combination of external factors occurring at our hosting provider’s data center.

Summary

Beginning at approximately 13:30 UTC, a distributed denial of service attack (DDoS) was initiated against the pool of servers our web services are hosted on. Neither Ushahidi nor Crowdmap were the targets of this attack, but all of the sites hosted in this pool were jointly affected none the less. Although a DDoS is always a serious concern, these attacks can normally be managed using a variety of techniques. In this case however, a bug in our hosting provider’s networking infrastructure made avoiding the attack extremely challenging. At approximately 14:45 UTC our provider powered down our server pool completely to patch the vulnerability in hopes of restoring services. For the next few hours that followed, our hosting provider encountered two technical obstacles that delayed restoring service. First, at 16:15 UTC, a problem with the newly patched networking configuration was encountered and needed to be fixed. Second, at 17:45 UTC, as our provider rushed to try to restore services our pool effectively crashed under the strain of all the servers booting up in tandem. Additional storage had to be added to support the scenario. At approximately 18:00 UTC, servers in our pool began successfully booting. Due to the sheer number of servers in our cluster, however, Ushahidi would need see its servers fully operational until 22:00 UTC. The total downtime of this outage was approximately 9 hours.

Analysis

This was an unfortunate outage that really shouldn’t have happened. A DDoS is always bad news, of course, but they are manageable. The brunt of the downtime was due to our provider rushing to fix the bug in their network infrastructure, and the complications that arose from that. It was a regrettable turn of events, and we shared in your frustrations. Ushahidi is in the middle of transitioning our servers to an advanced new infrastructure with the leading hosting provider in the industry. These new servers are fully managed, monitored at all hours of the day, and are located on completely different hardware from our existing structure. We are continuing the transition to this new infrastructure this week, and our goal is to have all our critical services moved in a few weeks time. We will of course keep you apprised of this as occurs.

Next steps: Maintenance this weekend

We are doing some server work on Friday, April 6 starting at 15:00 UTC. We don’t expect much, if any downtime but wanted to give you a heads up. The changes should mean better reliability of our service with less downtime moving forward. (For more Crowdmap maintenance details.) Thank you for your patience and understanding! (Written by Evan Sims, Senior Developer)