Data viz without statistical literacy is a problem. Playing with data in order to learn the best practices in design of data visualization is great, but putting data viz out in public that looks good but lacks statistical literacy is just as bad as poorly designed data images. Arguably worse. Data visualization is a powerful tool and people will believe false ‘knowledge’ and spurious relationships when they are beautifully displayed with, at best, a small written caveat that “correlation is not causation”. There is no amount of design principles that will correct for this.
It is essential to teach statistical literacy along with data visualization techniques. At the very least we have to truly emphasize to design students the need to partner with someone who has statistical literacy.
I am a big fan of data visualization. I believe that it is the powerful method of communication that it’s claimed to be. Communicating the results of data analysis in a visual way speaks directly to a part of the brain that many people like spending time with. So when I say that data image design is effective, I also mean that it’s important, and even beautiful.
But I have a concern about how data visualization is taught in some places.
When teaching data visualization, it is essential to teach data literacy along with the design skills. Teachers need to emphasize that if you don’t have the time or resources to work with data on a deep level, you should not make your data visualizations public.
Learning design skills and principles of data visualization is great, as is getting up to speed with interactive data design using D3, Tableau, and R. But knowing how to use those tools and understanding the principles of data design does not prepare you to create data visuals unless you also have a high level of data literacy. You’re playing with fire (but you don’t know how to properly analyze that fire).
Data viz can sometimes be thought of as a way to “let the data speak”, but if you’re not using proper statistical thinking – you are telling the data what to say. For example, visualizing two things in relation to each other without any statistical exploration is a very bad practice. When you look at two raw data points in relation to each other – the problem is much much bigger than the old “correlation is not causation” issue. There is no way to tell if these items are even correlated if you’re not using statistics. There is no way to tell whether this relationship is meaningful at all in any way or is simply due to chance, because of the way the data was collected or due to any number of other complex factors.
Let’s look at an example.
I work in a community that is concerned with a rising rate of youth mental health issues. The local social service agencies start to look at who’s at risk so they can do some preventative work. We have a database that collects data on the local youth; some of their individual characteristics as well as whether or not they have had a mental health issue in the past 12 months.
If I believe that this mental health situation is related to gender, I can make the data say that. I created the following chart by breaking down the “At-Risk” data down by gender.
From this chart, we might think that males are at a higher risk for mental health issues.
If I believe that this mental health situation is related to immigration, I can make the data say that. I created the following chart by breaking down the “At-Risk” data down by immigration status of the young people in my community. Again, I am not letting the data speak by breaking it out in these ways, I am telling it what to say.
From this chart, we might think that immigrant youth are at a higher risk for mental health issues.
Lastly, if I believe that this mental health situation is related to poverty, I can make the data say that. I created the following chart by breaking down the “At-Risk” data down by poverty level of the young people in my community. As with the two previous charts, I can make it look like whatever social indicator I’m interested in is affecting youth mental health. By simply displaying the raw data, I have no way of knowing if these differences I’m displaying are actually meaningful.
From this chart, we think that youth in not in poverty are at a higher risk for mental health issues.
However, if instead of breaking the data out one variable at a time, “allowing the data to speak” through data visualization, we can build a statistical model. This model will look at how gender, immigration and poverty are working together. Using this model, we can account for the way the data was sampled, we can account for the fact that possibly more males are living in poverty, we can actually establish whether the differences we’re seeing in our data viz are real differences or are artifacts of assumptions we came to the research with.
Here is the chart of the results of the statistical model.
Male non-immigrant in poverty was at highest risk and female immigrant in poverty is second highest risk.
If we look at variables one at a time – telling the data what we want it to say – telling the data that gender is important, or that poverty is important – we would say that immigrants are most at risk. Or that males are most at risk. Or that young people not living in poverty are the most at risk.
But if we actually let the data speak – we can see that males who are not immigrants who are living in poverty are the most at risk. This is a very different – and much more representative – story.
This is only one example. Sometimes the differences are larger or smaller when you apply statistical thinking to data as part of visualizing the data. You will not be able to tell which situation you are in until you test it. You cannot tell if your beautiful data viz is illuminating a true pattern or highlighting a false assumption that you are unaware of – until you apply statistical thinking to the data.
This is not to say that you shouldn’t work in the area of data viz. I think you should. But I think there needs to be a much stronger emphasis on statistical exploration and data literacy. Perhaps most importantly, practicing on unexplored data to create data viz should not be publicly displayed as in any way illustrating meaningful results.