There is no better feeling as a data journalist than getting your hands on some really interesting data. You rush to get it into Tableau or R, wait for trends to reveal themselves, and hit the ground reporting. What a rush!
Wait — aren’t you forgetting something?
There is one crucial step you need to take before you delve into tables, graphs, and charts to start constructing your story. It’s absolutely critical that you pause and ask yourself the most important data-related question of all: how was this data collected?
Two Ways Data is Collected
Knowing how a dataset is collected can provide you with valuable insights:
- What you can — and cannot — do with it
- What questions it will answer reliably
- What questions it may appear to answer but in a way you shouldn’t trust
There are, essentially, two basic ways data can be collected: experimentally or observationally.
The vast majority of data is collected observationally — someone looks at what is happening in a specific situation and records what they see in some form of data. Observational data includes most opinion polls, administrative data (like tax records or crime reports), and program data (like who attended classes at the local YMCA).
To collect data experimentally, you must first set up a controlled situation so you can record what occurs. The design of your experiment enables you to include specific kinds or sets of people, which allows you to make a stronger conclusion.
For example, an opinion poll based on a truly random sample of people would yield experimental data. But to be truly random, every person in the population would have to have an equal likelihood of answering the question — and setting up a survey that way can be time-consuming, expensive, and frankly, logistically challenging. Our blog post on nonprofits and RCTs goes into more detail about why experimental data is so much less popular.
Why Does It Matter?
Observational, experimental — does it really matter how your data was collected?
If your data was collected experimentally, you can safely say that the results of your analysis can be generalized to a broader population. If, on the other hand, your data is observational, your analysis results can only say things about the people in your sample or dataset — not the population at large.
Let’s say I want to know if people where I live want more bike lanes. If I conduct a truly random survey of people in my city — one where every resident has an equal chance of participating — I can safely say that my results are representative of the entire city. If I conduct a survey of random people in the local mall, my survey results can, at best, be representative of… people in the mall.
Before you lose hope, there are a few ways you can make your observational data more representative:
- By weighting your results (Veracio automatically weights the results of your online surveys)
- Using Propensity score matching
- Using difference in differences
- Using regression discontinuity design
The important thing is that you know how your data was collected — so you know if these steps are necessary.
Differences in Data Collection
To illustrate just how many variations there are in how data can be collected — even by well-intentioned and reputable sources — let’s take a look a couple of different news articles on Americans’ attitudes towards the environment:
This Time story tells readers that Americans don’t care that much about the environment. Gallup, who conducted the survey the Time article was based on, provides the details of their methods: they conducted a telephone survey of a random sampling of adults from all 50 states, calling landlines and cell phones in equal numbers.
Reuters boldly states that Americans want a strong environmental regulator. Their numbers came from an online Ipsos poll. Ipsos had not conducted a truly random survey, but more likely a panel of people who were paid to participate. Ipsos used weighting to make their results more representative.
Slate agrees that American voters want a leader who believes in climate change, despite the fact that election results seemed to indicate the contrary. Slate’s story relied heavily on the results of a report published by Yale, which was the result of analysis of combined data from six separate random surveys over the past three years.
The New York Times reports that most Americans believe global warming poses a critical threat, although they don’t think it has caused them any harm personally. The Times conducted telephone interviews with 1,006 randomly selected people across the US, and weighted the responses to ensure a representative sampling of cell phone, landline and dual-phone respondents.
Need Help Collecting Reliable Data?
If you are a journalist or nonprofit struggling to find reliable data — or determine if the data you’ve already collected is saying what you think it is — the team at Datassist is here to help. Our data experts can help you collect, clean, analyze and report your data in a way that is both honest and compelling. Get in touch with us now.