Trial, Error, Triumph: Lessons Learned Using LLMs for Creating Machine Learning Training Data Trial, Error, Triumph: Lessons Learned Using LLMs for Creating Machine Learning Training Data
The broad availability and performance of large language models (LLMs) enables practitioners to automate a variety of time-consuming tasks. Obtaining a... Trial, Error, Triumph: Lessons Learned Using LLMs for Creating Machine Learning Training Data

The broad availability and performance of large language models (LLMs) enables practitioners to automate a variety of time-consuming tasks. Obtaining a large number of quality labels for a machine learning training dataset is a critical step in supervised learning, but can require prohibitive amounts of time to manually generate. At this year’s ODSC East, Matt Dzugan outlined an approach that his team at Muck Rack employs to generate high-quality machine learning training datasets using LLMs.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.


While many natural language processing (NLP) tasks can be solved with LLMs, they don’t offer the most cost-effective or accurate predictions in every application. To illustrate how his team employs LLMs in an efficient manner, Matt worked through an example task of assigning relevant topics to a large volume of articles. Using an LLM to generate a topic for each article in production would be prohibitively costly at a large scale, processing millions of articles per day. Instead, one could train a more traditional NLP model on a suitable training dataset and use the trained topic classifier to score each article. Unless a training dataset is already available, it is necessary to create one.

Not all machine learning training datasets are equally useful; Matt illustrated that the best machine learning training datasets are easy to obtain, accurate, and generalize well to the data in production. The distinction was necessary because data generation methods may achieve some qualities much more than others. 

Figure 1: Three key qualities of effective training data.

Matt described four approaches to directly generate the machine learning training dataset using an LLM. The first approach was coined “The Labeler”, where an LLM is given each article and instructed to assign a topic of 1000 possible options. While “The Labeler” approach creates a dataset that generalizes well, the high context length incurs considerable cost and the model can hallucinate topics outside of the defined scope. “The Author” approach creates a dataset by starting with topics and using an LLM to generate an article to match. A key disadvantage of “The Author” method is a loss of generalization; the articles won’t necessarily be similar to articles seen in production, they will be more similar to the content present in the LLM’s training data. The third method “The Librarian” involves using an LLM to write a query to match articles in a database for a given topic. “The Librarian” approach can scale well but suffers from low accuracy due to the difficulty of matching topics given keywords in the query. The last LLM-based approach Matt discussed provides articles and an assigned topic to an LLM, instructing it to determine if the topic is a good match. This method suffers from a loss of scale when the search space for potential topics is large. One would need to evaluate a large number of topic-article combinations to obtain a sufficient number of appropriate matches. 

Machine Learning Training

Figure 2: Tradeoffs of the four LLM-based data generation methods, including “The Author” (top left), “The Librarian” (top right), “The Labeler” (bottom right), and “The Judge” (bottom left).

Level Up Your AI Expertise! Subscribe Now:  File:Spotify icon.svg - Wikipedia Soundcloud - Free social media icons File:Podcasts (iOS).svg - Wikipedia

To best balance the cost and quality of the dataset, Matt outlined a fifth approach called artificial semi-supervised learning. The first step is to employ the “The Author” approach of generating articles from a series of topics. While the data generated won’t generalize well by itself, the second step involves training a model on the initial dataset and scoring existing real-world articles. “The Judge” method can then be applied to determine if the trained model’s scores are well matched and allow poor matches to be discarded. This process can be repeated many times to grow the machine learning training dataset in a semi-supervised fashion. By combining the “The Author” method, an iteratively retrained supervised learning model, and the “The Judge” approach, the user can maximize accuracy, generalization, and efficiency. For perspective, Matt provided a comparison of cost for one example of topic classification. Using LLMs in an artificial semi-supervised fashion, incurred substantially lower costs than purely LLM-based approaches like “The Labeler” and returned a high-quality dataset. 

Figure 3: Cost comparison of four data-generating techniques where one bill is equal to $100. 

Matt’s talk illustrated that using LLM technology effectively requires thoughtful planning. While the example of topic modeling is not outside the NLP domain, the applications of LLMs are becoming more broad as their performance improves. By attending ODSC talks, engineers and data scientists in any domain can stay current with the types of problems being solved with more recent technology. ODSC Europe on September 5th-6th will have a wealth of similar content for key topics like LLMs, Gen AI, and AI for finance. Check out the confirmed speakers here: https://odsc.com/europe/

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 – Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!


Nathaniel Jermain

Nathaniel is a senior data scientist in the marketing industry, located in Saint Petersburg, FL. The focus of his work includes machine learning, statistical analysis, and a particular interest in causal inference. Feel free to connect with Nathaniel on LinkedIn: https://www.linkedin.com/in/njermain/