Author: William Vorhies
Summary: There are several approaches to reducing the cost of training data for AI, one of which is to get it for free. Here are some excellent sources.
Recently we wrote that training data (not just data in general) is the new oil. It’s the difficulty and expense of acquiring labeled training data that causes many deep learning projects to be abandoned.
It also matters a great deal just how good you want your new deep learning app to be. A 2016 study by Goodfellow, Bengio and Courville concluded you could get ‘acceptable’ performance with about 5,000 labeled examples per category BUT it would take 10 Million labeled examples per category to “match or exceed human performance”.
There are a number of technologies coming up through research now that promise more accurate auto labeling to make creating training data less costly and time consuming. Snorkel from the Stanford Dawn Project is one we covered recently. This area is getting a lot of research attention.
Another approach is to build on someone else’s work using publically available datasets. You can begin by building your model in the borrowed set, you can blend your data with the borrowed data, or you could use the transfer learning approach to repurpose the front end of an existing model to train on your more limited data.
Whatever your strategy, the ability to build on publically available datasets is always something you’ll want to consider, so your ability to find them becomes key.
Here are some notes on where you might start your search. These won’t all be labeled image and text but a lot of them are. And for those of you looking to use ML and statistical techniques, there’s plenty here for you too.
Wouldn’t it be delightful to just Google the type and subject of datasets we want. You may already have your favorites, for example NOAA and NASA for weather. But until early September, Google search didn’t include metadata search for datasets. However, thanks to Google’s acquisition of Schema.org the metadata for datasets is now recognized by Google’s knowledge graph. This is in beta. You can find it here:
https://toolbox.google.com/datasetsearch
Google staff says they’ve already indexed more than a million items that appear to be datasets but there’s a way to go before this is pure. There are some refinements already available. Here’s a subsidiary search site just for truly public datasets.
https://cloud.google.com/public-datasets/
This page will also lead you to some special subsets like:
- Google BigQuery Public Datasets (the first terabyte download is free but charges apply after that).
- Google Genomics Public Datasets
- Geo Imagery Datasets
Too much interesting stuff here to list but it includes Github, Medicare data, public IRS forms data, and about 9 million URLs to images that have been labeled spanning 6,000 categories.
Microsoft
Not to be outdone, Microsoft recently launched a similar site called Microsoft Research Open Data, also in beta.
MS Research Open Data doesn’t search the entire web, but rather makes available 53 previously proprietary datasets all in the realm of deep learning, both text/speech and image.
Academic Torrents
This smaller not-for-profit offers just under 2,000 datasets totaling about 28 terabytes. This is a distributed system for sharing very large datasets covering a very eclectic range of topics. It is searchable, but perhaps not with the comprehensive nature of the Google site. In addition to downloading, you might want to consider uploading your dataset for others to this site.
Skymind
Skymind is a commercial platform to rapidly prototype, deploy, maintain, and retrain machine learning models. They offer 101 datasets from a variety of sources that cover Natural-Image, Geospatial, Facial, Video, Text , Question answering, Sentiment, Recommendation and ranking systems, Networks and Graphs, Speech Datasets, Symbolic Music, Health & Biology, and Government & statistical data sets.
https://skymind.ai/wiki/open-datasets
Github / Kaggle / Federal Government Sources
We should never forget our tried and true traditional sources:
Github: 565 data sets.
https://github.com/awesomedata/awesome-public-datasets
Kaggle Public Datasets: 10,992 current listings.
https://www.kaggle.com/datasets
Data.Gov. The home of the US Government’s open data. Currently 302,944 datasets.
Figure Eight
This commercial provider of human-in-the-loop data currently offers only eight datasets. The reason for their inclusion here is unique. Figure Eight makes its reputation by providing accurate data, especially enhancing the accuracy of its client’s data.
What we have not discussed above is the issue of accuracy and you should rightly be concerned before you accept a public dataset as foundation for your new AI application.
https://www.figure-eight.com/datasets/
The other interesting aspect of Figure Eight is their promotion of active learning techniques. The phrase ‘Active Learning’ can be a bit misleading. It’s actually a method for incrementally improving training data quality without committing all the training data for human review, and approaching a statistically supportable point of ‘best accuracy’, balanced against lower cost.
Incidentally, Figure Eight makes a convincing case that especially in the area of NLP where you might be training chatbots; it’s worth the investment to have multiple reviewers for each item drawn from different demographics in order to avoid cultural bias in the interpretation. You know, is that a hoagie, sub, hero, po-boy, grinder, torpedo, etc. If active learning is of interest there’s a good DSC webinar here with Figure Eight’s leading expert in the field.
Other articles by Bill Vorhies.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001. He can be reached at:
Bill@Data-Magnum.com or Bill@DataScienceCentral.com