Subscribe To Our Newsletter

Get tips and tools to tell your data story better.

No, thanks

 In Case Studies, Data Analysis Concepts Simplified, How To

In last week’s blog post, I talked smoking and life expectancy, and how they’re related. We looked at graphs that seem to indicate that smoking a certain number of cigarettes each day could actually extend your life expectancy.

That is very clearly not true (and a rather silly suggestion at that). But the confusion around causality wouldn’t be an issue if it was always incredibly easy to recognize differences in our data relationships. Since I started highlighting the importance of using data correctly last week with a post on the ecological fallacy, I thought we might as well make this a little refresher course.

What is False Causality?

First things first. You’ve heard the saying correlation does not equal causation? Of course you have. It’s been said by pretty much every statistician ever (including me, on this blog, at least four times in previous posts). That expression is about the false causality fallacy. Just because we can identify a pattern between two data points, doesn’t mean we can assume that one is causing the other.

Common wisdom dictates that to avoid issues with causality, we must control for more variables. Tease out the lurking variables, the confounders, the hidden factors. And sometimes, that’s true. But (you knew there was a “but” coming) often it’s not. Including third, fourth (and so on) variables in your model can change the question that you’re asking — and the one you’re answering.

How Does it Happen?

There are different kinds of data relationships. And, not unlike human relationships, they can be pretty complicated. While we may think we understand the situation after a quick look, there’s often a lot more going on than we can perceive from the surface. When examining patterns in your data, it’s crucial to ask yourself what’s really going on.

Tools for Prediction Don’t Work Well for Causality

Here’s an example. Netflix uses algorithms to predict what movie you might want to watch next. They include as many variables as possible in the model. Why does this work? Because they don’t care why you’re going to watch the movie, they just want to predict whether or not you are, so they can suggest movies that will keep you watching. It’s about prediction, not causality.

Predictive models can include all variables.

Adding a Third Variable Changes the Question

When you add a third variable to the relationship, you introduce at least two possible relationships that this third variable (Z) has to your relationship of interest (X → Y).

If you’re making a prediction, it doesn’t matter if Z is a confounder or a mediator. You can include it in your model. However, if you’re trying to figure out what causes the relationship between X and Y, things get trickier. You should include Z if it’s a confounder, and you should not if it’s a mediator.  Confusing? Let’s look at an example.

When determining causality, it’s important to know if additional variables are mediators or confounders.

Could There Be Another Factor?

Is another factor you don’t see that would change — or eliminate — the relationship?

A while back, Datassist provided some analysis for a project measuring the impact of cash transfers to young pregnant mothers. The project’s goal was to decrease the prevalence of low birthweight babies.  The X here is the cash transfer and the Y here is the probability of low birthweight baby. Is X causing Y? Does the cash transfer system reduce the chances that the baby will have a low birth weight?

Could cash transfers to poor mothers decrease the likelihood they would delivery low birthweight babies?

Some of these women used the cash to buy and consume different types and amounts of food.  So nutrition is the Z variable. Is it a confounder or mediator? You need to decide before you build your model.  

How does nutrition affect the relationship between cash transfers for poor mothers and low birthweight babies?

It Depends on Your Question

There is no mathematical test to decide if Z is a confounder or a mediator.  It all comes down to what question you want to answer. Take a look at our models below.

The green line shows results when we control for nutrition, the yellow line shows results when we don’t.

The green line is what our results look like with nutrition in the model. It’s us “controlling” for nutrition. So what question are we answering then?

If a mother’s nutrition does not change over the course of the project, how does a cash transfer affect her chances of delivering a low birthweight baby?

The yellow line is what our results look like if we assume nutrition is a mediator and don’t control for it. What question does that answer?

How does the probability of a low birth weight baby change over time, including as her nutrition changes?

How to Avoid False Causality

Short answer: very carefully. As we really increasingly on AI and algorithms do the heavy lifting with our data, it’s very important that we understand what’s happening within those systems.

Understanding causality is critical in an age of machine learning.

Here are some key points that you must keep in mind to avoid falling victim to the false causality fallacy.

  • Data visualizations using the same variables can look very different, depending on the data relationship that is being visualized.
  • Algorithms designed to analyze data are often good at developing predictive models (estimating what might happen next). They are not so good at developing causal models (determining why something happened in the first place).
  • It’s important to be very deliberate when including extra variables in your models. While you can’t automatically exclude any of them, you also can’t inherently know which to include. Use statistical reasoning to ensure all the data that should be included, is.
  • Two heads are better than one. If you’re worried about falling into the trap of false causality, ask the experts at Datassist for help.
Recommended Posts

Start typing and press Enter to search

You you really know how to use data correctly?The prosecutor’s fallacy ruined Sally Clark’s life.