Last week we talked about using Tabula to free data from PDFs and get it into a workable state. Sometimes when extracting data using tools such as Adobe Acrobat, Tabula or simply cutting and pasting from documents and websites, you get the data, but it’s not quite in the format you need to work with. Often it comes in a format that isn’t quite recognized by Excel as a well-formatted spreadsheet, because Excel may condense several columns of data into a single column of text.
Excel includes an option called “Text to Columns” which is one of the secret weapons of data productivity in Excel. It’s flexible, easy to use and has saved me hundreds of hours in my years of working with data. For this demonstration, we’ll use the data we extracted from the PDF last week. Only, instead of choosing the “Download CSV” option in Tabula, we’ll choose the “Copy to Clipboard” option. (However, any copied data will work.)
The “Text to Columns” command puts each record in a separate column, dividing the data into columns, as well as getting rid of unwanted delimiters that often come with data that’s cut and pasted from the internet.
Step 1: Paste the unformatted data into an Excel workbook
Step 2: Select the column of unformatted data and choose “Text to Columns” from the Data menu.
You can choose from the preset delimiters such as Tab, Semicolon, Comma, and you can also enter any special character or letter that your data contains in the “Other” box. This trick is quite handy if you have nuisance material in your copied data, such as extra dashes or colons. Simply enter them in the “Other” delimiter box and they will be separated from your data by the tool.
If this is how you want your data to appear when it is separated into columns, proceed. If it’s not, play a bit with the selection of delimiters in the menu above. Most times, with a little experimenting, you can get your data to be reformatted exactly how you’d like it.
Your unusable data is now usable!
Datassist has available a number of “How to’s”, resources, and articles sharing what we find useful in the field of data science for data journalism organizations, non-profits, and business organizations. As you have time, you might look through previous Datassist blogs, such as our data series and the blogs about data-driven journalism last year, including Smart Data-Driven Journalism Tools.[dismissable_alert class=”alert-info” ]…and take time right now to: