Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. (Wikipedia) So I had a Crowdmap full of data sitting there. It had to be cleaned ready for further analysis by researchers. This blog is how I managed to clear the irrelevant/corrupt data to produce an accurate database. After downloading the large CSV file of data it was obvious for a start that there was an issue with trusted reports. There were 2084 reports with no geo locations  or categories, a total mixture of everything. They were from all forms of entering information on the platform. The reports marked trusted were not able to be cleaned as the information had not been reviewed, just approved and verified. So there were so many categories missing and the majority had not be geolocated. The main learning from me is there has to be a quality control team as sadly there was so much data not useable Having to go through these one by one to check then was extremely time consuming. This is really worth remembering that you set criteria out before people class report as trusted.   Following the guidelines that Ushahidi had already published on the Wiki: I removed all personal information from every report. Removed all data that could not be geolocated. Removed all duplicates   The most simple, less time consuming way was to use filters in Excel. How many sms came into platform   4372 (Taken from information on dashboard) How many SMS were turned into reports  17 (Taken from clean data) How many SMS were approved  17 (Taken from clean data) Verified 8 (Taken from clean data) Ultimately mapped.  17 (Taken from clean data The SMS reports ended up as just 17, the reasons for this are: Not relevant to deployment Not enough information to be useful Many had no geolocation possible. Some SMS were just put on platform as "trusted source" with no information. Not stated as an SMS when report was created. (thus unable to state which was a SMS and which were not. Which does not show the true end figure of over 1600 clean data reports and how many were actually SMS) A point worth remembering if it is a SMS then make sure SMS is on report when it is created. Having 70 categories was also a challenge. This is before the data was cleaned:
Geolocated 2339
Trusted Reports 1907
No Need To Translate 1470
Everything Fine 741
Translated 684
Polling Station Logistical Issues 418
Impossible to Geolocate 323
Other 252
Voting Irregularities 201
Threat of violence 178
Unresolved 170
Unverified 149
BVR Issues 142
Voter Integrity Irregularities 139
Civilian Peace Efforts 127
Provisional Citizen Results 110
Voter Register Irregularities 106
Violent Attacks 104
GEOLOCATION 98
IEBC Officials not Acting In Accordance to Set Rules 81
Voter Identification Issues 60
Absence/Insufficient IEBC Officials At Polling Station 55
Irregularities with voter assistance 54
Mobilisation towards violence 51
Missing/Inadequate Voting Materials 49
Counting Irregularities 47
Fear and Tension 45
Rumours 41
TRANSLATION 38
Demonstrations 33
Ballot Box Irregularities 32
Absence/Insufficient law enforcement officials at Polling Station 32
Eviction/population displacement 29
Property Loss/Theft 25
Police Peace Efforts 25
Polling station logisitcal issues 23
Polling Station Closed Before Voting Concluded 22
Campaign material in polling station 21
Ambush 20
Protest over declared results 20
Purchasing of Voters Cards 20
Observers/Media Blocked From Entering Polling Station 19
Hate Speech 18
Resolved 18
POSITIVE EVENTS 17
Irregularities with transportation of ballot boxes 15
Party Agent Irregularities 14
Verified 14
Failure to Announce Results By IEBC official 13
Presence of weapons 13
To Be Geolocated 11
Riots 11
Sexual and Gender Based Violence 8
To Be Translated 8
Armed Clashes 6
Voters Issued Invalid Ballot Papers 5
URGENT 4
Abductions/kidnapping 4
SMS-V 4
Bombings 4
Ballot Boxes Destroyed After Announcing Final Results 3
Purchase of weapons 3
certificate Issues 2
Polling Station Administration 1
Voting Issues 1
After using the filter functions in excel, I manually had to go through each line of the CSV file making sure I had not missed anything. Please if anything can be learnt from this data cleaning post deployment, it is that quality control HAS to be in place during deployment, This is key to gaining accurate information. I gained so much personally from completing this task. I hope it helps you when performing a deployment and want to use results post event. If anyone has any questions I am always available to answer them. Happy Mapping Folks, Jus