Building a reliable and efficient Crowdmap

TL;DR — Crowdmap operates on a Rackspace Cloud-based infrastructure. We run nginx, PHP-FPM, MySQL and memcached. We have servers in Dallas, Chicago and Tokyo, and use Amazon's Route 53 (DNS service) to direct users based on their latency to the datacenters. New Crowdmap was designed from the start to be highly bandwidth efficient for mobile users and implements innovative techniques and bleeding edge technologies to achieve this.

Howdy! I’m Evan Sims and I’m a senior developer at Ushahidi. Aside from getting to build awesome software, I also help oversee our server infrastructure. Today I'd like to talk to you about the technologies and effort that go into keeping Crowdmap online and running smooth.

At Ushahidi we get excited about building tools that empower individuals, communities and organizations to do amazing things. We’re constantly trying new things and pushing ourselves into new territory, because we know that in the end our tools will turn out the better for it. Crowdmap has been a particularly challenging and rewarding adventure for us. It was our first attempt at a large scale hosted service, and meant hosting tens of thousands of Ushahidi installations, managing software upgrades, securing servers, ensuring databases were backed up, monitoring performance, providing support and delivering on expectations of availability. It was and continues to be no small endeavor.

Scaling out and growing up

[caption id="attachment_13290" align="alignright" width="150"]

They aren't kidding.[/caption]When I joined Ushahidi in December of 2011, my first task was to transition us to a robust hosting platform that would give us room to grow. Crowdmap at the time was with a provider that, while operated by great folks, just wasn’t able to keep up with the service’s exploding growth. I immediately began moving us to Rackspace’s Cloud platform, who we continue to run on today. They’re an amazing company with a stellar team, and I really enjoy working with them. After a few months of building, tuning and rigorous stress testing (I was a bit paranoid about my first task being a horrific, flaming failure), we flipped the switch and Classic’s new infrastructure went live. This new infrastructure was comprised of a robust load balancer and reverse proxy, two app servers, and a MySQL replication cluster with a total of 16GB of RAM. We also switched from Apache to nginx and PHP-FPM, which was a tremendous boost in and of itself. This was also the first time Crowdmap had true redundancy and fail-over support. There were growing pains, of course; configurations had to continue to be tuned over the year that followed to fix fringe cases and allow from extremely large deployments to continue to run as smoothly as possible, but overall the move went off splendidly.

Going global with Version 3

A few months ago we launched a completely new version of Crowdmap, written from the ground up to be something new and different from the core platform. This was an interesting development journey (which I wrote about recently over on my Medium account), but when it came time to launch it meant expanding our server infrastructure.

I am the Keymaster. Are you the Gatekeeper?
I am the Keymaster, are you the Gatekeeper?

We needed to host both Classic and New simultaneously from the same domain. This meant building a new load balancer that would allow us to hot swap users to the different app clusters based on a flag. I built a service called Gatekeeper that basically maintains a catalog of all the Classic sub-domains, and assigns you to the Classic cluster where appropriate. Otherwise, Gatekeeper looks for a simple cookie to determine where to drop you off. By default, you'll land on New Crowdmap.

The cookie approach has its drawbacks. We have to explicitly filter connections from social networking services like Facebook or Twitter that ping us for Open Graph tags, search engines looking for specific content, and of course clients that expect an API endpoint but aren’t built to understand cookies or redirects. We have these things largely sorted out now, but they took a few days and in some cases weeks to get right. You can’t plan for every contingency, and sometimes you just have to do it live. I also wanted to push Crowdmap’s infrastructure out to other datacenters, so that users far away from Dallas (where the Rackspace datacenter we’re primarily housed at is) could achieve better performance. I discovered an extremely inexpensive way of doing geolocation-based namserver resolution using Amazon’s Route 53 service. It allows you to weight DNS direction based on latency. So, rather than trying to guess where you’re connecting from based on your IP (which is expensive and often inaccurate), we can just see which datacenter is the fastest for you and drop you there. It's very nice. With that figured out, I took a few days to boot up our first servers in Tokyo, Japan with Amazon EC2. Unfortunately Rackspace is limited to Dallas, Chicago and Sydney datacenter locations right now, so we had to go outside of their network on this endeavor. Ultimately New Crowdmap is comprised of a mirrored load balancer for the Tokyo region, the Gatekeeper server, three app servers, two MySQL and memcached servers. New Crowdmap is designed to be highly efficient and requires significantly fewer server resources compared to Classic’s monolithic approach; almost all of these servers are running on 1GB of RAM or less!

Every byte counts.

We built New Crowdmap from a blank slate, and in doing so re-evaluated everything. Classic was a monolithic platform that emphasized volume and management from a deployer perspective, and it perfectly fits that need. We wanted to take a slightly different approach for this reboot, however, and make something more approachable to everyday users that emphasized the importance of content ownership, and obliterated all friction to the posting process. This of course meant that mobile had to be a priority for us.

Mobile is a very tricky beast. Aside from dealing with the quirks of different browsers and variation of capabilities based on operating systems, there is the issue of network performance. Most people in the world aren’t breezing along on LTE. They have to make do with EDGE or 3G: extremely narrow pipes. Worse yet is the latency overhead of mobile connections. Your phone is constantly hopping between cell towers, re-established its data connection, dealing with packet corruption. We knew we wanted to be able to deliver stunning high resolution photos to users, but also had to accommodate users on slower connections. This seems like a contradiction, but it really just meant we have to be smart about what we’re sending to users in these different conditions. We’re doing a lot of leg work with every photo uploaded to us. We generate four different sizes of the photo and only deliver the one appropriate for the device’s screen resolution. We use a scrapping process to remove every unnecessary byte of metadata from produced images (and of course to remove sensitive EXIF metadata like embedded geolocation stamps.) We convert PNGs into JPGs as we don’t need the transparency, and JPG offers better compression options. We convert to Progressive JPG to avoid Mobile WebKit resolution limitations, and let users get a preview of media as it’s loading. We also produce alternative versions of each photo resolution using Google’s WebP format, which offers insanely good compression ratios – often 50% smaller than JPG – and deliver those when the device supports it. I built a library called uxImage (available here) that helps make it all work on the client side.

I feel the need for speed!

Crowdmap’s frontend is almost entirely built in JavaScript, and is spanned across many modules - mostly so those of us who have to work on it don't go completely insane. Every HTTP connection your browser makes is quite an expensive time investment - especially so on mobile where latency is much more noticeable. We use RequireJS to modularize our scripts, and RequireJS’ optimizer to merge all of the modules down into just a couple scripts and then transform it by removing white-space and line breaks, and other minification techniques. For our style sheets we use a similar process using SASS; everything is merged into one CSS file and minified. (We still have a long way to go in optimizing this portion of our code though.) Of course, Crowdmap runs 100% on SSL due to sensitive nature of some of the reporting that our users do. SSL has traditionally been computationally expensive to deal with, but optimizations in recent years have made it nearly negligible. We recently implemented Google’s SPDY protocol which reduces latency by implementing multiplexing, and intelligent prioritization and compression. (I highly recommend it.) We even recently began implementing Google’s PageSpeed module on a trial basis, which does a lot of impressive things, but most importantly compresses HTML much like SASS does CSS: removing line breaks, white-spacing, and so forth.

Final thoughts

Infrastructure and application scalability are complex and constantly evolving issues, as anyone in the industry will tell you. Every single day we’re finding new ways to optimize, squeeze out a little more performance, and improve service reliability. Since our transition to our new infrastructure we haven’t suffered a major incident of downtime thanks to careful planning and a strong backup and recovery strategy. I'm really happy with how far we've come with our infrastructure, and I'm excited to think about where we'll be a year from now.

Feel free to tap my shoulder on Twitter, App.net or Google+ if you'd like to talk about any of this in more detail. You can also reach me at evan@ushahidi.com, but I try to avoid my inbox as much as humanly possible.