Examining state-level data to understand the effects of police force sizes on crime rates, we notice that regions with larger police forces tend to predict higher crime rates. From this, we can conclude that, by cutting police force resources, we can effectively reduce crime rates in any given area, since smaller police forces = lower crime rates.
You’re probably shaking your head at me right now. Obviously, that conclusion makes no sense. But what happens when data leads us to similarly erroneous but less easily discounted conclusions — because we draw connections that just aren’t there?
The importance of understanding prediction vs. causality in data analysis can’t be understated.
Imagine a world in which we didn’t all immediately dismiss the notion that larger police forces were responsible for higher crime rates. Law enforcement resources in high-crime areas would be reduced based on the findings I advanced earlier, and well, I think we know crime rates wouldn’t go down. Differentiation between the predictive and causal relationships is especially critical when using data analysis for social good, since an inaccurate interpretation of your data can actually do more harm than good.
If you plan to undertake a data analysis project for journalism, impact measurement, or really, any other cause, make sure you understand the tools you’re using before you get started.
Prediction vs. Causality: What’s the Difference?
The difference between prediction vs. causality is a concept you must understand before beginning any data analysis projects. So what is the difference?
When two pieces of data have a predictive relationship, their variables are likely to change together. In our example above, regions with a larger police force tended to have a higher crime rate, and vice versa. It doesn’t matter which we consider first — the force resources or the crime rate — the variation is synchronized.
In contrast, when two pieces of data have a causal relationship, one is (at least partly) responsible for the other, and the order in which we examine them is critical. Saying that a large police force causes a high crime rate is very different than saying a high crime rate necessitates a large police force.
Prediction vs. causality is a fundamental distinction you must understand before you get involved in any data analysis to ensure you don’t mislead your audience — or worse, your own team.
The Dark Side of Data
In her book Weapons of Math Destruction (which I can’t recommend highly enough), Cathy O’Neil warns of the dangers of blindly trusting algorithms and statistics that are used to sort, score, target, and monitor all aspects of our lives. There is great risk in the assumption that statistics and mathematical models are more fair simply because they are free from human bias or discrimination.
Consider a poor student applying for a loan to finance a university education. The algorithms used by financial institutions to determine the risk of default on a loan flag him as high risk: he lives in a low-income area. Because his loan application is denied, the student is unable to afford higher education and is condemned to remain in poverty — after all, without the education, he is unqualified for better-paying jobs.
Prediction vs. causality: which was it here? Will the student’s address cause him to default on a loan, or have we condemned him to a low-income existence by denying him a loan?
Balance Caution With Hope
Of course, just because there is a dark side to applying data to decision-making, that doesn’t mean we should abandon the idea altogether. In their Guide to Solving Social Problems with Machine Learning, Jon Kleinberg, Jens Ludwig and Sendhil Mullainathan suggest that caution applied to the use of statistics and data can be balanced with hope. They emphasize the importance of understanding the relationships within the data we use to recognize when data can be trusted, and when human judgment is necessary to override it.
So how do we know when data can be trusted?
Consider the following questions before applying data analysis to your social project:
- What is the relationship between the data points you’re using: predictive or causal?
- Do you have enough historical data to be confident you’ve correctly identified the relationship?
- Is there a measurable outcome?
- Is there a possibility your data is biased?
Let’s return to our original example of the relationship between the size of a police force and the crime rate in that area. Obviously, we are not going to recommend a reduction in police resources in high-crime areas, but why?
Because prediction and causality are not the same thing. Yes, a large police force predicts high crime, but not because it causes high crime. It’s far more likely that municipalities with higher than average crime rates invest more into police resources to handle said crime.
So what do we do?
Remember, the initial study was conducted using state-level data. In this case, we must conduct a more nuanced, careful analysis that uses data on changes within local police forces and local crime rates, and then combine our findings into a larger model — which would provide insight into causal impacts rather than predictive impacts alone. Educating yourself on how to use data precisely will help you understand where to look for data that will accurately provide the information you’re looking for.
Make Meaningful Change With Your Data
For those interested in more in-depth instruction on data exploration and storytelling, I am co-hosting a free online data journalism course right now with the amazing Alberto Cairo. It’s not too late to sign up!
If you’re struggling to tell your story with data analysis, Datassist is here to help. Our statisticians and data visualization specialists are proud to support meaningful change around the globe by helping organizations of all sizes collect and analyze valuable data and transform it into educational, accessible stories that captivate your audience. Get in touch with us today to learn more.