I spend a lot of my time educating people on how to use data. I try to teach them to use it to engage their audience while ensuring the story they tell is honest and transparent. One of the most common questions I get from students in my classes is how to deal with missing datasets.
At first blush, that seems like a totally reasonable inquiry. It is. How do you work with numbers when some of the numbers you need aren’t there?
But looking at the issue more closely can prompt new questions. Why aren’t those numbers there? Is it significant? What are missing datasets really? Has the data — accidentally or deliberately — not been collected? Or does it exist, but you just haven’t found it yet?
Exploring Missing Datasets
Some people are diving deep into questions about missing datasets and their significance. Mimi Onuoha is an artist and researcher who has taken the exploration of missing datasets to a new level. She examines the various ways in which we abstract, represent, and classify people with data — and what it means when some people don’t show up in our data.
Her project “On Missing Datasets” is fascinating:
“Missing datasets” are my term for the blank spots that exist in spaces that are otherwise data-saturated… Unsurprisingly, this lack of data typically correlates with issues affecting those who are most vulnerable in that context.
Onuoha’s preoccupation with missing datasets goes beyond the technical issues of how to deal with gaps in your data to consider why those gaps exist in the first place. We’ve talked a little about this before; the threat posed by data deserts is all too real. In her project, Onuoha includes an ever-evolving list of missing datasets — data that is simply not collected — including:
- Civilians killed in encounters with police or law enforcement agencies
- Trans people killed or injured in instances of hate crime
- Poverty and employment statistics that include people who are behind bars
- Muslim mosques/communities surveilled by the FBI/CIA
- Mobility for older adults with physical disabilities or cognitive impairments
- Undocumented immigrants currently incarcerated and/or underpaid
- True measures around how often sexual harassment happens in the workplace
- Caucasian children adopted by parents of color
Certainly, one could make an argument for the importance of gathering data on any of these subjects. Which leads us to ask: why isn’t it?
The Data Collection Process
“If you haven’t considered the collection process, you haven’t considered the data.”
As Onuoha observes in her article “The Point of Collection”, technical, practical, and ethical issues with data begin at the point of collection. Missing datasets are rarely statistics that have been gathered and then misplaced, and only occasionally data we simply haven’t gotten around to collecting yet. She cites four key reasons for the existence of missing datasets.
- Those with the ability to collect the data choose not to
This is most common when the missing datasets would likely show the people in power hold an unfair advantage or are abusing their position.
- The data we want doesn’t fit our mode of collection
Sometimes, the data we want simply resists quantification — we lack a reliable scale on which to measure our subject.
- Collection appears to be more work than it’s worth
Certain types of data are challenging to collect. If providing data proves more painful than rewarding, data owners may hesitate to supply it.
- The data is to be collected from people who don’t wish to be identified
In some circumstances, providing data would remove a level of protection provided by anonymity, again making data owners reluctant to share their data.
Missing datasets are not simply mistakes or random gaps. Often the data we choose not to collect can tell a story as meaningful as one from the data we gather easily.
“The point of data collection is a unique site for unpacking change, abuse, unfairness, bias, and potential. We can’t talk about responsible data without talking about the moment when data becomes data.”
~Mimi Onuoha, The Point of Collection
Are You Searching for Missing Datasets?
Sometimes the data that isn’t there can mean just as much as the data that is. Creating a data biography is a crucial step in uncovering bias in how and why your data was collected. Applying statistical reasoning can also help keep your data honest.
Are you struggling with missing datasets? Would you like to fill the gaps in your data — or tell a story about why they exist? The experts at Datassist can help. We work with journalists, nonprofits and government organizations to help tell honest, accurate, and engaging data stories. Get in touch with us to discuss your project.