Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data.
After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. (Wikipedia)
So I had a Crowdmap full of data sitting there. It had to be cleaned ready for further analysis by researchers.
This blog is how I managed to clear the irrelevant/corrupt data to produce an accurate database.
After downloading the large CSV file of data it was obvious for a start that there was an issue with trusted reports.
There were 2084 reports with no geo locations or categories, a total mixture of everything. They were from all forms of entering information on the platform.
The reports marked trusted were not able to be cleaned as the information had not been reviewed, just approved and verified. So there were so many categories missing and the majority had not be geolocated.
The main learning from me is there has to be a quality control team as sadly there was so much data not useable
Having to go through these one by one to check then was extremely time consuming. This is really worth remembering that you set criteria out before people class report as trusted.
Following the guidelines that Ushahidi had already published on the Wiki:
I removed all personal information from every report.
Removed all data that could not be geolocated.
Removed all duplicates
The most simple, less time consuming way was to use filters in Excel.
How many sms came into platform 4372 (Taken from information on dashboard)
How many SMS were turned into reports 17 (Taken from clean data)
How many SMS were approved 17 (Taken from clean data)
Verified 8 (Taken from clean data)
Ultimately mapped. 17 (Taken from clean data
The SMS reports ended up as just 17, the reasons for this are:
Not relevant to deployment
Not enough information to be useful
Many had no geolocation possible.
Some SMS were just put on platform as "trusted source" with no information.
Not stated as an SMS when report was created. (thus unable to state which was a SMS and which were not. Which does not show the true end figure of over 1600 clean data reports and how many were actually SMS)
A point worth remembering if it is a SMS then make sure SMS is on report when it is created.
Having 70 categories was also a challenge. This is before the data was cleaned:
Geolocated | 2339 | |
Trusted Reports | 1907 | |
No Need To Translate | 1470 | |
Everything Fine | 741 | |
Translated | 684 | |
Polling Station Logistical Issues | 418 | |
Impossible to Geolocate | 323 | |
Other | 252 | |
Voting Irregularities | 201 | |
Threat of violence | 178 | |
Unresolved | 170 | |
Unverified | 149 | |
BVR Issues | 142 | |
Voter Integrity Irregularities | 139 | |
Civilian Peace Efforts | 127 | |
Provisional Citizen Results | 110 | |
Voter Register Irregularities | 106 | |
Violent Attacks | 104 | |
GEOLOCATION | 98 | |
IEBC Officials not Acting In Accordance to Set Rules | 81 | |
Voter Identification Issues | 60 | |
Absence/Insufficient IEBC Officials At Polling Station | 55 | |
Irregularities with voter assistance | 54 | |
Mobilisation towards violence | 51 | |
Missing/Inadequate Voting Materials | 49 | |
Counting Irregularities | 47 | |
Fear and Tension | 45 | |
Rumours | 41 | |
TRANSLATION | 38 | |
Demonstrations | 33 | |
Ballot Box Irregularities | 32 | |
Absence/Insufficient law enforcement officials at Polling Station | 32 | |
Eviction/population displacement | 29 | |
Property Loss/Theft | 25 | |
Police Peace Efforts | 25 | |
Polling station logisitcal issues | 23 | |
Polling Station Closed Before Voting Concluded | 22 | |
Campaign material in polling station | 21 | |
Ambush | 20 | |
Protest over declared results | 20 | |
Purchasing of Voters Cards | 20 | |
Observers/Media Blocked From Entering Polling Station | 19 | |
Hate Speech | 18 | |
Resolved | 18 | |
POSITIVE EVENTS | 17 | |
Irregularities with transportation of ballot boxes | 15 | |
Party Agent Irregularities | 14 | |
Verified | 14 | |
Failure to Announce Results By IEBC official | 13 | |
Presence of weapons | 13 | |
To Be Geolocated | 11 | |
Riots | 11 | |
Sexual and Gender Based Violence | 8 | |
To Be Translated | 8 | |
Armed Clashes | 6 | |
Voters Issued Invalid Ballot Papers | 5 | |
URGENT | 4 | |
Abductions/kidnapping | 4 | |
SMS-V | 4 | |
Bombings | 4 | |
Ballot Boxes Destroyed After Announcing Final Results | 3 | |
Purchase of weapons | 3 | |
certificate Issues | 2 | |
Polling Station Administration | 1 | |
Voting Issues | 1 |