If you’re anything like me, you don’t like cleaning. Going out to a park or an event is fun, going to work feels productive. Cleaning my house is neither. And making the leap over to the data world doesn’t make it any more exciting. Analysis and visualization — there’s plenty of intriguing trends and exciting stories there, but data cleaning? Who cares?
Why Data Cleaning Matters
We collect data to make informed, evidence-based decisions. It doesn’t take an expert to see that if your data is incorrect, analyzing it will lead to inaccurate conclusions.
Take a look at that dataset you’ve been tinkering with:
- Can you accurately interpret the data?
- Are there typos or multiple formats?
- Are there missing or incorrect values?
Resolving data quality problems (which is a fancy way of saying data cleaning) is often where the lion’s share of time and resources are spent. Data cleaning helps to:
- Ensure proper process procedures
- Focus on analysis and relationships
- Find and repair anomalies.
There are many types of data — descriptive, longitudinal, streaming, web (scraped), text, numeric, and so on. Each can come with their own unique quality issues related to the data gathering process and end use.
How Can I Minimize the Cost?
As I’ve pointed out throughout this series on surveys, planning ahead — from the research question through data collection stages — will really pay off when you find yourself at the data cleaning stage.
If your field team uses a template during collection, initial data cleaning actually occurs as data is entered, because coders are prompted as they enter the data. A template will help limit:
- Inaccuracies and typos
- Inability to account for what is important
- Incomplete data
- Vague results that offer no clear guidance
By preparing early (read: before you start collecting data) you can significantly reduce the time and resources required for data cleaning later on.
What Does Data Cleaning Involve?
Data cleaning includes auditing, validating, and correcting values against a known list of entities for data integrity. This ensures complete, consistent data. Data cleaning is an iterative process. You’ll start with an auditing workflow, mainly using database and statistical tools, specifying parameters, and eventually generate code to identify data anomalies.
(If you want to go deeper into the tools and steps involved, check out Theodore Johnson’s presentation on Data Quality and Data Cleaning.)
Here are a few key tips to keep in mind to help lighten your data cleaning load:
- There are many helpful tools that will identify inaccurate pieces in your data and automatically replace, modify, or delete them.
- If your data can’t be corrected using an automated tool, remember that you must conduct another audit after manual cleaning.
- Document of your steps to avoid re-cleaning data that you’ve already corrected or found free of errors
- It can be challenging to correct duplicate values or invalid entries since often the information you have is not enough to determine the correct value. Deleting data is the next-best option, but remember that losing too much data can be costly.
- If you’re coding open-ended responses to a survey, try copying the responses onto individual cards and grouping similar comments together to get a sense of the most common answers. (ThemeSeekr is a great software alternative that uses word clouds to highlight frequent responses.)
Consult with a Data Cleaning Expert
As you can see, you’ll need to pay serious attention to detail to effectively and efficiently audit and clean your data. If you’re struggling with a data cleaning project, we’re here to help. Get in touch with our team now.