Optimizing Ushahidi for Quick Reads on Large Datasets

[Guest Post: John Etherton is the founder and managing partner of Etherton Technologies, Ltd. John is an Ushahidi Trusted Developer. This is a cross-post from his blog.]

I recently had a client that uses Ushahidi to monitor water wells in Afghanistan ask me about increasing the speed of their site. They currently monitor around 3000 wells, but will soon ramp up the project to monitor all the wells in the country. This would increase the number of wells to 100,000 plus. Preliminary testing showed that the out-of-the-box Ushahidi platform using, our Enhanced Map plugin for map rendering, could not load fast enough to provide an acceptable user experience with this much data. My client wanted to be able to zoom and pan the map, as well as filter by category, as quickly as possible, and with the default setup this was taking as much as several minutes with large sets of data. Below I’ll outlines the steps taken to fix this. I hope this will help others with similar problems. But first a quick disclaimer: I do realize that 100,000 isn’t really the definition of large in the computer world these days. Some would say you can’t talk about “large data” until you’re talking billions or trillions, but for the Ushahidi community this is uncharted territory, and feels pretty darn big.

Define the Problem

The first thing to do is understand what the problem is. The site is password protected for security reasons, so the problem isn’t handling thousands of hits a day. It’s more likely the site will have 10-100 hits a day. The site is used to track broken water wells, but is also used to look for trends in the performance of the wells. Thus users will run simple queries when they need to find which wells aren’t functioning, but they’ll run rather complex queries when they are looking for trends. Things like, “Show me all wells that are in province A, B, or C, and have been repaired in the last 30 days, and that were suffering from a drop in the water table.” Data for the site is uploaded on a regular basis via Excel files. The site doesn’t use crowdsourcing, so it won’t be receiving lots of writes to the database. From this we know that write performance isn’t very important. If it takes 60 seconds to import 2000 new wells from an Excel file instead of 10, that’s not a big deal because this will only happen once a week and the users who execute these operations can be trained to wait. What isn’t acceptable is taking 60 seconds to load the Ushahidi home page, and each subsequent map manipulation. We know that we’re willing to trade write performance for read performance. We can also deduce that caching could help a little bit with the initial loading of the map page, and the lists of wells that need maintenance, but for exploratory queries that are aimed to reveal trends in the data, caching won’t do much good. We’re also not dealing with large numbers of hits, where caching would be a great solution. To solve the problem we’re facing we’ll want to ensure that the database is optimized for quick reads, well indexed, and that the functions and algorithms that Ushahidi uses to render the map are optimized to spit out data. Also for the sake of this article let me make it clear that each well is represented by an Ushahidi “Incident.” For the rest of the article I’ll call them incidents to be general. The Enhanced Map plugin was used to provide more advanced queering on the Ushahidi map than is possible using the default map. With Enhanced Map user can run queries using more than one category and can choose if the results from each category should be joined using a boolean ∧ (and) or a boolean ∨ (or). For example, “I want to see all incidents filed under category A or B or C.” or “I want to see all incidents filed under category X and Y and Z.” Enhanced Map has two versions that accomplish this different ways. The standard version accomplishes this by using the Reports::fetch_incidents() help function, the high performance version uses it’s own SQL to query the database using JOINs instead of nested queries. JOINs are slightly faster than nested queries in Ushahidi. The advantage of the standard version is that it will play nicely with other plugins that modify the filtering of data via the ushahidi_filter.fetch_incidents_set_params event. This project used the high performance version, since performance was most important. Both versions of Enhanced Map use the same clustering algorithms as the /application/controllers/json.php cluster() method. The site uses clustering because thousands of dots are harder to look at and make sense of than clusters, especially when zoomed out to country level. This project uses Ushahidi 2.1. I know a lot has changed since then, but to the best of my knowledge, 2.5 still has the same bottlenecks that I outline, and the principles behind the changes I made should be applicable to any version of Ushahidi to date.

Set Some Baselines

These tests were run locally (i.e. I ran everything off of localhost) on a home built machine running Ubuntu 11.10, PHP 5.3.6, and MySQL 5.1 with 24GB of RAM, an Intel i7 950 CPU, and a Seagate Barracuda 1TB hard drive. What can I say, I don’t like running low on memory. Anyway… The next thing I did was setup a test instance of the site to test out the changes I made. I made a copy of the production site and then add extra incidents until there were 40,000 in total. With the extra wells it took around 30 seconds to render the home page map and timeline. I setup another test site with 100,000 incidents, but it took over a minute to load the home page, so I stopped using it because it took too long to test things, and I figured if things ran well at 40,000 incident it would run well at 100,000. As I said earlier the site would be used to run some rather complex queries looking for trends. The query below took 7 seconds to run against the database with 40,000 incidents in phpMyAdmin:

SELECT incident.id, location_id, incident_title, incident_date, incident_active, province_cat.id as province_cat, location.latitude as lat, location.longitude as lon FROM `incident` as `incident` LEFT JOIN `location` as `location` ON `location`.id = `incident`.`location_id` LEFT JOIN `incident_category` as `province_incident_cat` ON `province_incident_cat`.incident_id = `incident`.`id` LEFT JOIN `category` as `province_cat` ON `province_cat`.`id` = `province_incident_cat`.`category_id` LEFT JOIN `incident_category` as `cat0` ON `cat0`.incident_id = `incident`.`id` LEFT JOIN `incident_category` as `cat1` ON `cat1`.incident_id = `incident`.`id` WHERE incident.incident_active = 1 AND incident.incident_date >= '2010-03-01 00:00:00' AND incident.incident_date <= '2012-08-31 23:59:59' AND `province_cat`.`parent_id` = 5 AND ( ( cat0.category_id = 6 OR cat0.category_id = 7 OR cat0.category_id = 14 ) AND ( cat1.category_id = 83 ) ) GROUP BY `incident`.`id`

The fact that it took so long to run in the database showed us that the database itself had some issue. However, the 7 seconds in the database didn’t account for the 30+ seconds it took for the page to render. So, we know that the clustering algorithm and other Enhanced Map/Ushahidi/Kohana/PHP components are slow.

The Database

The first problem is the lack of indexes in Ushahidi’s database. Given how cheap hard drives and memory are these days, the only reason to not use indexes on every field that you’ll see in a WHERE clause is if the time it takes to update an index on an insert is too slow. This would be a problem if you had lots of inserting going on. Which, as discussed above, is not a problem for this site. I added the following index based on what I knew the site would search on: category

parent_id – If you want to quickly search based on a parent category form_response

form_response – A full text index, so queries can quickly find specific custom form field responses. incident_category

incident_id – I’m pretty sure these two made the biggest difference for the query I showed above.
category_id

location

latitude - Obviously you want these two fields to be searched quickly
longitude

Just by adding these indexes, the above query went from 7 seconds to 0.3 seconds. Not half bad. The site I was working on used categories to denote if a well was working or not, well type, implementation status, training status of well caretaker, and the province the well is in. Thus, with 40,000 incidents there were about 200,000 entries in the incident_category table. That’s why adding indexes to that table helped so much. One thing that caused me a lot of confusion for a while was that foreign keys and indexes are not the same. I had just assumed that a foreign key would imply an index, so I added a lot of foreign keys to Ushahidi with no speed increase, but when I added indexes things became faster. Just keep that in mind, foreign keys != indexes. Another thing that helped a lot was upgrading to MySQL 5.5 and switching all the tables that I could to the InnoDB engine. I happen to live close to some Oracle offices and I recently attended a MySQL Tech Tour. I spoke with the MySQL folks about my problem and they suggested indexes and upgrading to 5.5 and switching to InnoDB. I was skeptical at first. MySQL is 17 years old and very fast as is. Lots of major websites, like Facebook, use MySQL. What could have been such a big improvement that made it so much faster? After all it was only a minor version number change. Also, when you Google around for “MySQL read performance” you see posts that say that MyISAM is best for fast reads. MyISAM is the database engine Ushahidi uses. However, I was wrong. After upgrading from MySQL 5.1 to 5.5 and switching over to InnoDB, the query to pull all the reports to render the home page went from 0.85 seconds to 0.44 seconds. Again, not half bad. Note: this is after adding indexes to the database as described above. Apparently the combination of 5.5 and InnoDB means that servers with enough RAM can store their tables in memory and run extremely fast. Who knew? The great thing about these changes is that they don’t modify the schema, and thus Ushahidi works just fine, unaware of the changes to its database. If you are super hardcore about performance reads from your Ushahidi instance and are willing to change the schema, I’d suggest the following:

Merge the location and incident tables. There’s one location per incident, so there’s no reason to have the locations in a separate table. Having a separate table just creates an unnecessary join when looking up incidents.
Pull non-essential fields out of the incident table, especially large ones. Since MySQL 5.5 and InnoDB tables can be loaded into memory for super fast performance you want your tables to be small. Thus I’d suggest removing the incident_description field into a separate table. It’s been my experience that users rarely search over the description and as a LONGTEXT field it can take up a lot of space. However, if your site does require frequent searching on the incident_description field this may not be for you.

So after all of that the database was lighting quick. Before these changes, it used to be that the more JOINs and WHEREs I used, the slower my queries went, and by slower I mean minutes, now those queries actually ran faster than simple queries because they returned smaller result sets. I tired messing with the innodb cache size in the MySQL setting file with no luck. It seemed my changes either had no effect or made things slightly worse, so I’d recommend leaving the defaults in place unless you really know what you’re doing. Also, I will say that the upgrade from MySQL 5.1 to 5.5 was a pain. For whatever reason Ubuntu doesn’t have MySQL 5.5 in their package repositories, so I had to upgrade manually. Not the most exciting way to spend a Monday night.

Enhanced Map, Ushahidi, Kohana, PHP

I ran some test and realized that one of the slowest parts of rendering the front page map was pulling 40,000 incidents out of the database and into PHP. I was using Kohana’s database library to run my SQL and give me nice and neat objects in return. That was taking about 28 seconds to give me all the incidents in the database. I tried using PHP’s built in MySQL Functions, and to my surprise the time to pull everything from the database dropped to 1.5 seconds! Unbelievable. As I understand it the problem is that creating objects in PHP takes time. So if you don’t need nice and neat objects, drop the Kohana DB library and ORM stuff, and just use the built in PHP MySQL functions. Also, on that note. Write your queries to only pull in the data you need. Don’t use SELECT *, if you don’t need it. The less data MySQL has to send back to PHP the faster it will be. Finally things were starting to run as quickly as they needed to for a web app to be usable. The home page map was now loading in a little under 2 seconds. Not bad for 40,000 incidents. The next bottleneck was the clustering algorithm. The current Ushahidi clustering algorithm is composed of two nested loops. The first loop takes a stack of incidents, pops off the first one, we’ll call this, “the seed”, and then compares the seed to all the other incidents in the stack. If any incidents are found that are under the distance threshold from the seed, those incidents and the seed are added to a cluster and removed from the stack. The algorithm then pops the next incident off the top of the stack, creating a new seed, and then starts again. The problem with this algorithm is that it’s running time in a worse case scenario is 0((1+n)*(n/2)), which we can simplify to O(n^2), where n is the number of incidents. So if we have 40,000 incidents that algorithm will require 800,020,000 iterations to run, which would be painfully slow. This would happen in the case where no clusters are formed and every incident is rendered as its own dot. Thus, as the number of clusters reduces, and the number of individual dots increases, the running time for the algorithm goes up. Clusters reduce when the user zooms in on the map and the clusters are broken up smaller and smaller. Thus, the more zoomed in on the map the user is, the longer and longer the algorithm will take to run. At zoom level 6, using Google Maps, the clustering algorithm took .6 seconds. But when zooming down to level 12 or 13 the algorithm would take 15 seconds or more. To fix this I rewrote the clustering algorithm to use an associate array, keyed off of a rounding of the latitude and longitude of the incident. For example at zoom level 6, I would take an incident at 33.43543, 68.7370338 and give it a key of “33.4,68.7″ and put the cluster at lat 33.4, lon 68.7 At zoom level 8 I would give the same incident a key of “33.44,68.74″ and put the cluster at lat 33.44, lon 68.7, and so forth. I came upon this algorithm when I realized that what was needed for an Ushahidi map to look good wasn’t true clustering, but just downsampling. By giving every incident a key that reduced the resolution of its position, so that it easily aligned with the position of other incidents, I could create clusters in one pass, and thus the running time would be constant at O(n). This new algorithm could create clusters for all 40,000 incidents at zoom level 6 in 0.4 seconds, and at zoom level 13 in 0.6 seconds. To make this clustering as accurate as possible I setup some rules so that clusters of just one incident would use their original lat, lon to make sure they were placed correctly. Also clusters with more than one incident would use the average of their lat and lon and not just their key to determine their position. This allowed for a more “organic” looking clusters. You could also tweak the look of the clusters by adjusting the hash algorithm that creates the key. Below are some screen shots of the before and after clustering at zoom level 11 using the same incident data. The original clustering algorithm is on the left and the new one on the right. The new algorithm is set to create smaller clusters closer together, given a more accurate idea of where the incidents really are:

The next problem I had was that OpenLayers doesn’t work well with 40,000 points. When I tried rendering all 40,000 incidents at once in open layers it took well over 60 seconds for them to render. In addition to making the user wait, long loops in JavaScript freeze the entire browser, making the whole thing unresponsive. This is a big no, no for a good user experience. Whenever a user zooms way into the map, the number of clusters will be very low and the number of individual points very high, and thus the time to render the clusters server side, and the time to render them client side, will become very, very slow.To combat this I rewrote the mapping system, both client and server side, to only send the data for incidents that were near the users current view point once the user had zoomed in past a certain point. That way the system only loads the incidents that the user is looking at, and that it is likely to pan over to. The system was also setup to detect when the user was getting close to panning past the set of data that was gathered and refresh the data at that point.This took the run time of the new downsampling algorithm at zoom level 13 down from 0.6 seconds to 0.1 seconds, and made the client side render very fast (sorry I don’t have real numbers for that speed increase). This implementation also has the added benefit of reducing the number of results from the database each time the user zooms in. With all the above changes in place, what used to take between 30 seconds and 2 minutes to display on my clients Ushahidi map, now takes between 0.13 seconds and 2 seconds. All in all not bad. I would also like to add that all of this was done with out modifying the core Ushahidi code, just a few plugins. I’d love any comments, critiques and other feedback on this article. Thanks, John.