The Data Scientist’s Holy Grail – Labeled Data Sets The Data Scientist’s Holy Grail – Labeled Data Sets
The Holy Grail for data scientists is the ability to obtain labeled data sets for the purpose of training a supervised... The Data Scientist’s Holy Grail – Labeled Data Sets

The Holy Grail for data scientists is the ability to obtain labeled data sets for the purpose of training a supervised machine learning algorithm. An algorithm’s ability to “learn” is based on training it using a labeled training set – having known response variable values that correspond to a number of predictor variable values.

There are a number of common and maybe not-so-common methods for labeling a data set. In this article, we’ll run down a short list of such methods and then you can choose the best for your specific circumstances.

[Related Article: 20 Open Datasets for Natural Language Processing]

Readily Available Labeled Data Sets

Sometimes, labeled datasets are readily available as a byproduct of on-going business operations. For example, if a company is trying to predict customer churn (a very common classification problem), the company’s data assets will likely contain the label values: “churned,” or “not-churned” based on the customer’s account history. The company knows when the customer canceled their account, thus generating a churn transaction.

Sometimes, the label is not readily available and must be acquired or derived. For example, in a real estate application that wishes to predict the monthly rental value of a residential apartment building, the desired label may only come from a laborious process conducted by problem domain experts who can determine the value based on their industry knowledge. Sometimes finding label values can be time-consuming and labor-intensive, especially if a large amount of labeled data is needed for the project.

Crowdsourced Labeling Services

Another way to obtain a labeled data set is by using a resource like Mechanical Turk by Amazon (MTurk). MTurk is a clearinghouse for performing Human Intelligence Tasks, i.e. things best done by humans equipped with the most powerful computer of all – the brain. MTurk is a facility frequently used by data scientists to get results from Mechanical Turk workers who earn a small stipend for each classification completed. It is the perfect collaboration to tap into the power of the human brain in making classifications. MTurk is a great resource to generate labeled datasets for machine learning applications.

In the scientific realm, astronomers have been using crowdsourced methods for several years to classify the types of distant galaxies with Galaxy Zoo (now part of Zooniverse for many other types of classification problems).

Training Set Generating Products

There are also companies that offer technology for generating training data sets. Take Figure Eight for example. Their Human-in-the-Loop machine learning platform transforms unstructured text, image, audio, and video data into customized, high-quality training data.

The Figure Eight technology platform uses machine learning-assisted annotation solutions to create training data needed by statistical learning models. The company supports a wide range of computer vision and natural language processing use cases and a broad range of industries. The Figure Eight platform operates at an unprecedented scale, having generated over 10 billion training data labels to power real-world AI applications.

The Chinese Business Model

China, with its more affordable and accessible resources, has yet another different approach toward labeling data sets. There are a number of Chinese data labeling factories popping up like the one in Jiaxian, a city in the central Henan province. The firm compares itself to the assembly line of 10 years ago. This and other like-minded Chinese start-ups realize that AI has to be taught and that in order to learn, the technology must digest vast amounts of labeled photos and videos before it realizes that a Great Dane and a Chihuahua are both dogs. This is where the data factories and their workers come in – teams of people go through photos and videos, labeling just about everything they see – cars, pedestrians, stop signs, bicycle riders, etc.

Labeling data sets may be China’s biggest AI strength, one that the U.S. may not be able to match. The Chinese government and companies enjoy access to mountains of data, thanks to weak privacy laws and enforcement.

[Related Article: 25 Excellent Machine Learning Open Datasets]


Data scientists implicitly understand that an untrained statistical model is virtually useless. Without high-quality labeled training data, supervised learning isn’t able to do what it does best – learn. Only with properly labeled data can we ensure that models are able to predict, classify, or otherwise analyze real-world data to provide accurate business insights. Fortunately, we have a number of methods for obtaining labeled data at our disposal — it’s just a matter of choosing the method that matches your needs best.  

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.