The excitement around big data for social good is palpable, and its capacity for change is enormous. However, in order to realize this capacity the humanitarian community needs to embrace a fundamental shift in the relationship between data and crisis.
Insights from massive datasets have revolutionized economics, marketing, transportation, and other for-profit industries, but crisis data gathering and analysis remains as fractured and chaotic as the disrupted communities it attempts to serve. Collecting information about a region experiencing disaster, conflict, or other humanitarian emergency is daunting and time consuming -- even if you're fortunate enough to stumble upon the loose cadre of volunteer gatekeepers trying to manage the mess. Sadly, even as good samaritans catalogue aid data and incident reports in an increasingly overwhelming spreadsheet diaspora, these types of static, silo-ed data are quickly out of touch with the rapidly changing reality on the ground, and subsequently irrelevant.
We can do better.
From Static to Streaming
The traditional view of humanitarian data is a survey collected by an NGO or government aid worker. Scribbled responses are collated, assessed and aggregated by teams of analysts or academics, and their narrative of the data is presented to decision-makers in DC, London, Geneva, etc. Fortunately a small, digital vanguard are building systems for cataloging or sharing these static datasets, or creating databases populated by reports culled from traditional media, manually collected and verified by teams of trained volunteers. This is the perfect role for a large organization with broad reach, and there are promising early signs that projects like the Humanitarian Data Exchange will slowly begin imposing order on the chaos.
However the data collected in these types of systems is weeks, months or even years old by the time it's made publicly available. To put this kind of turnaround time in context, only three weeks passed between the Ukrainian people's successful ouster of former prime minister Vladimir Yanukovych and pro-Russian annexation of Crimea.
No crisis is in stasis, and therefore the relevance of static data is inversely proportional to its age. If we're interested in using big data technologies for more than elaborate post-mortem assessments, we need to view crisis data as constantly evolving, streams of information, prioritized by their proximity to right now.
Whether it's social media, sensor data, crowdsourced or crowd-seeded reporting, or manually maintained databases exposed over web APIs, we need to mandate that crisis data be available immediately and continuously, and build our systems to accommodate that necessity.
These streams are inherently volatile, ambiguous and messy. The top-down, grand-design approach that's effective in cataloging static datasets isn't suited to disparate, heterogeneous waves of information that are only useful when considering the context (time, place, network, etc) in which their data are published. If a pattern exists in this data, it emerges, but is never imposed.
Therefore the software systems we build need to be equally as flexible, resilient, and open. As technologists it's our job to empower actors local to the crises they're experiencing with unopinionated APIs and data processing pipelines that handle the heavy lifting of real-time data science without the heavy-handed perspective of direct analysis.
Free Your Data
While streaming data is the only way for real-time data science, limited Internet access and time-consuming nature of crowdsource technology adoption or crowd seeding means traditional first-person incident reporting still has value -- if only the data were available.
Aggregated analysis is essential to deriving knowledge from even modestly sized datasets. However typically organizations only release that data in the aggregate, displayed in charts, graphs and high level maps via proprietary, unstructured formats like PDF. This not only stifles innovation, it keeps data locked away from those most likely to benefit from its insights, which is shameful.
It's essential that those of us working with data for good impress upon these agencies the value of disaggregated, incident-level information. Data at this level can be combined, processed, inspected and understood in concert with billions of other open data points for previously unfathomable insights into the nature of conflict and natural disaster. Accessible data can and will be used in ways we can't foresee, enabling new solutions from innovators brimming with technical capacity but lacking in access to first-person reporting from the breadth of crises actively monitored by agencies like the UN, Red Cross, and Amnesty International.
Democratized access to incident-level information will empower global technologists to redefine the relationship between data and crisis.
Crisis data is the most important and most urgent information produced by the digital age. People's lives are at stake, it's time to be open, work smarter, and think bigger.
See how we're building the future of crisis data at Crisis.NET.