Have you heard about Google’s new Dataset Search feature? If you’re into data the way we are around here, you’ve probably already seen it, tried it, and noted some of the issues with it. But if you’re not, it’s worth understanding the pros and cons.
What is Dataset Search?
For starters, it’s probably worthwhile to make sure you understand what Google’s new tool even does. Dataset Search does… pretty much what the name says. The tool was designed to make public data more accessible to, well, everyone.
There are literally thousands (if not more) of data repositories online. But if you don’t know where they are, you can’t really take advantage of them. Natasha Noy, a Google research scientist, says the feature was launched “so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.”
Sounds great, right?
How Does it Work?
First off, let’s take a look at some examples of the kind of information that qualifies as a dataset for the purposes of the Dataset Search tool:
- A table or a CSV file with some data
- An organized collection of tables
- A file in a proprietary format that contains data
- A collection of files that together constitute some meaningful dataset
- A structured object with data in some other format that you might want to load into a special tool for processing
- Images capturing data
- Files relating to machine learning, such as trained parameters or neural network structure definitions
The way Dataset Search works is that it sifts through all the tags data owners apply to their datasets. So data tagged with words relevant to your search terms will appear in your search results. Institutions that have published their datasets online tag them, and the tool — based on search terms from schema.org — to crawl through those tags and serve up the results you need. (Here are some of the technical details.)
So What’s the Downside?
There are a couple of issues with Google’s Dataset Search — one that Google acknowledges and one they don’t really mention.
- Dataset Search doesn’t search for words within the datasets — just tags.
That means, to quote Google’s own blog post about the issue, “A search tool like this one is only as good as the metadata that data publishers are willing to provide.” It’s not a serious problem. But it is a limitation. If someone has published the exact data you’re looking for but not tagged it, you’ll never find it using this tool.
Many funding agencies today specify that data from projects they finance must be made publicly available online. But information can exist online without actually being recoverable — making those well-meaning stipulations rather meaningless.
- Like most Google tools, Dataset Search relies on a closed black-box system.
What’s the big deal about that?
There is no public API. Meaning there is no way to know what is — and isn’t — being indexed. The Google team has developed an algorithm to rank datasets in the search results, which makes total sense. But there are issues when we give too much power to algorithms we can’t see.
I’ve talked about Cathy O’Neil’s Weapons of Math Destruction before, but this bears repeating. There is great risk in assuming statistics, algorithms and other mathematical models are free from bias simply because we don’t see humans manipulating them. Google has gone to some great lengths to keep their algorithms proprietary in the name of protecting their business model. Which sounds fair enough until you start relying on those algorithms to decide what matters and what doesn’t.
Should You Use Dataset Search?
New tools are always nice to have. And it doesn’t take much to sell me on the benefits of anything that makes data sharing easier for everyone. The important thing, as with many new technologies, is to understand the limitations.
Need help accessing public data for your project? Want to know more about why algorithms need to be treated with caution when it comes to data? Talk to the experts at Datassist today.