In last week’s blog post, I talked smoking and life expectancy, and how they’re related. We looked at graphs that seem to indicate that smoking a certain number of cigarettes each day could actually extend your life expectancy.
That is very clearly not true (and a rather silly suggestion at that). But the confusion around causality wouldn’t be an issue if it was always incredibly easy to recognize differences in our data relationships. Since I started highlighting the importance of using data correctly last week with a post on the ecological fallacy, I thought we might as well make this a little refresher course.
What is False Causality?
First things first. You’ve heard the saying correlation does not equal causation? Of course you have. It’s been said by pretty much every statistician ever (including me, on this blog, at least four times in previous posts). That expression is about the false causality fallacy. Just because we can identify a pattern between two data points, doesn’t mean we can assume that one is causing the other.
- Ice cream purchases do not cause shark attacks
- Nicolas Cage films do not prompt pool drownings
- Wearing sunscreen does not cause cancer
- Large police forces do not incentivize crime
Common wisdom dictates that to avoid issues with causality, we must control for more variables. Tease out the lurking variables, the confounders, the hidden factors. And sometimes, that’s true. But (you knew there was a “but” coming) often it’s not. Including third, fourth (and so on) variables in your model can change the question that you’re asking — and the one you’re answering.
How Does it Happen?
There are different kinds of data relationships. And, not unlike human relationships, they can be pretty complicated. While we may think we understand the situation after a quick look, there’s often a lot more going on than we can perceive from the surface. When examining patterns in your data, it’s crucial to ask yourself what’s really going on.
Tools for Prediction Don’t Work Well for Causality
Here’s an example. Netflix uses algorithms to predict what movie you might want to watch next. They include as many variables as possible in the model. Why does this work? Because they don’t care why you’re going to watch the movie, they just want to predict whether or not you are, so they can suggest movies that will keep you watching. It’s about prediction, not causality.
Adding a Third Variable Changes the Question
When you add a third variable to the relationship, you introduce at least two possible relationships that this third variable (Z) has to your relationship of interest (X → Y).
If you’re making a prediction, it doesn’t matter if Z is a confounder or a mediator. You can include it in your model. However, if you’re trying to figure out what causes the relationship between X and Y, things get trickier. You should include Z if it’s a confounder, and you should not if it’s a mediator. Confusing? Let’s look at an example.
Could There Be Another Factor?
Is another factor you don’t see that would change — or eliminate — the relationship?
A while back, Datassist provided some analysis for a project measuring the impact of cash transfers to young pregnant mothers. The project’s goal was to decrease the prevalence of low birthweight babies. The X here is the cash transfer and the Y here is the probability of low birthweight baby. Is X causing Y? Does the cash transfer system reduce the chances that the baby will have a low birth weight?
Some of these women used the cash to buy and consume different types and amounts of food. So nutrition is the Z variable. Is it a confounder or mediator? You need to decide before you build your model.
It Depends on Your Question
There is no mathematical test to decide if Z is a confounder or a mediator. It all comes down to what question you want to answer. Take a look at our models below.
The green line is what our results look like with nutrition in the model. It’s us “controlling” for nutrition. So what question are we answering then?
If a mother’s nutrition does not change over the course of the project, how does a cash transfer affect her chances of delivering a low birthweight baby?
The yellow line is what our results look like if we assume nutrition is a mediator and don’t control for it. What question does that answer?
How does the probability of a low birth weight baby change over time, including as her nutrition changes?
How to Avoid False Causality
Short answer: very carefully. As we really increasingly on AI and algorithms do the heavy lifting with our data, it’s very important that we understand what’s happening within those systems.
Here are some key points that you must keep in mind to avoid falling victim to the false causality fallacy.
- Data visualizations using the same variables can look very different, depending on the data relationship that is being visualized.
- Algorithms designed to analyze data are often good at developing predictive models (estimating what might happen next). They are not so good at developing causal models (determining why something happened in the first place).
- It’s important to be very deliberate when including extra variables in your models. While you can’t automatically exclude any of them, you also can’t inherently know which to include. Use statistical reasoning to ensure all the data that should be included, is.
- Two heads are better than one. If you’re worried about falling into the trap of false causality, ask the experts at Datassist for help.