Why audit and clean data?
The entire survey process is undertaken to make informed, evidence-based decisions from the results of the survey. It isn’t a big leap to see that if the data is incorrect, the analysis will lead to inaccurate conclusions. Take a look…Can the data be accurately interpreted? Are there typos, multiple formats, missing or incorrect values?
Resolving data quality problems is often where the majority of survey time and resources are spent. Data cleaning covers a number of disciplines to ensure proper process procedures, focus on analysis and relationships, and find and repair anomalies. The many types of data, including descriptive data, longitudinal data, streaming data, web (scraped) data, text data, numeric data, and so on, each have unique quality issues related to the data gathering process and end use for the data.
How can the costs and resources for data cleaning be minimized?
As I’ve pointed out throughout this survey series, planning ahead, from the research question through data collection, will really pay off at the data cleaning stage of the process. If the field team has a template, the data can go through an initial cleaning concurrent with data entry, by prompting coders as they enter the data. This helps to avoid inaccuracy, inability to account for what is important, incomplete data that cannot be interpreted, and vague results that provide no clear guidance.
The University of Wisconsin’s Survey Guide states, “A consistent process for organizing and analyzing survey data should be established and clearly documented well ahead of receiving the first responses, with everyone involved receiving ample training.” This early preparation, before the survey is undertaken, can significantly reduce the need for extensive data cleaning at the end.
What is involved in data auditing and cleaning?
Data cleaning involves auditing, validating, and correcting values against a known list of entities for data integrity, to ensure complete, uniform, consistent data. It’s an iterative process, starting with an auditing workflow, primarily using database and statistical tools, specifying parameters, and then generating code to identify data anomalies. (For a more detailed look at common tools and steps involved in the process, Theodore Johnson, AT&T Labs Research, has several articles and presentations on Data Quality and Data Cleaning.)
- There are helpful tools for identifying inaccurate parts of the data, and then automatically replacing, modifying, or deleting. Where there is data that cannot be corrected with this process, the next step is manually correcting the data – which requires another audit afterwards.
- To avoid cleaning data that has been previously corrected and found free of errors, a cleansing lineage would need to be kept.
- Correcting values that are duplicates or invalid entries is difficult, since often the information available is not enough to determine correct data. Deleting data is the next best solution, but then you have lost the data, which can be costly.
- Regarding open-ended response coding, the University of Wisconsin’s Survey Guide, describes the process:
…usually involves examining some preliminary data to identify potential categories, and then testing to determine how consistently the categories are assigned by various coders. To analyze responses to open-ended questions, you can copy the comments onto individual cards and then group similar comments together. This will give you a sense of the most frequent ideas. Alternatively, there are software packages that help in analyzing responses to open-ended questions. ThemeSeekr was developed to aid in processing the thousands of comments received during the University of Wisconsin’s 2009 Reaccreditation self-study, and uses “word clouds” as a visual analysis tool to show the relative frequency of the themes into which responses have been categorized.
As you can see, data auditing and cleaning requires great attention to detail for effective management, managing costs, time, and results. As you complete your survey process and have questions or issues, you can look through the other articles in this survey series, or send a message for a more specific response from @Datassist on Twitter.
Be sure to sign up for our monthly resource list for September 1, which will provide a number of useful data tools.
…So take time right now to: