There is a (probably apocryphal) story that is often told to young quantitative political scientists. During World War II, US Bomber Command was losing dozens of planes on each mission. Every bomber was expensive and crew member highly trained, so keeping them alive and flying was a top priority. To protect the bombers, the US military decided to add armored plates to the aircraft. But where on the aircraft to put them? Steel is heavy and each plate increased fuel consumption, so you cannot armor every surface: you have to be selective.
To answer the question, Bomber Command brought in a team of accountants. After each mission, the accountants walked the tarmac and examined where each returning bomber had visible damage: wings, tail, nose, etc. They recorded their observations, and in a few weeks came back to US Bomber Command and told them to install the armor where the aircraft received the most damage. US bomber command followed their advice. Over the next few weeks, US bomber losses doubled.
In desperation, US Bomber Command turned to a statistician. "What happened?" They asked. "What did we do wrong?!". The statistician smiled slightly and shook his head, "You were examining the right thing, but on the wrong aircraft," he said. "Don't place the armor where you see the most damage," he said, "place it where you see the least." Puzzled, they followed the statistician's advice and moved the armored plates. US bomber losses immediately dropped to record lows. So, why did the statistician's advice work?
The answer lies in the nature of observation. Unlikely the accountants, the statistician was familiar with a phenomenon called the selection effect. Selection effect occurs when we can only observe non-random slice of what we are studying. That is, we only see a "selection" of cases, not all cases. By walking the tarmac after each mission, the accountants were examining aircraft that managed to survive the mission, not all the aircraft that took part. Adding armor to where these bombers received the most damage makes sense until you think about the problem in terms of the selection effect: If the surviving aircraft received damage in these areas and still made it back, then this is not where the armor needs to go. Instead, the armor plates need to cover the areas where the surviving bombers received the least amount of damage, because planes that received damage to those areas didn't make it back to base. In other words, where the damage occurred selected which aircraft were observed on the tarmac after the mission and which crashed into the European countryside.
Data can tell us much about a population. It can provide insights into characteristics and trends impossible through anecdote means. But when using data, we — like US Bomber Command — must be aware of what we are not seeing, and why we aren't seeing it. We must think carefully about which segments of the population are invisible in our datasets: which populations lack the money to send in SMS reports; which regions lack the connectivity to use social media; which demographics lack access to mobile phones. These groups are often the hardest to reach using data, but given our goals are often the most important. They do not deserve to be made invisible by our use of data: we have a responsibility to make their voices heard.
There is no easy solution for identifying and reaching these populations; no silver bullet for counting them. Instead, the only solution is to think carefully about what the data overlooks and how we can compensate for it. It is more work, but it is critical to success.