For data scientists, ground truth is the holy grail. If we think of AI as software that is taught with examples, instead of instructions, then selecting the right examples is critical to building a system that performs well.
This is the data of record that reflects verified examples of the correct outcome. Ground truth assumes:
- There is only one record for a given example.
- We have some definition or rubric that is universally applied. For example, when an image is labeled as a cat, we have clear guidelines on how to handle borderline cases like ‘tigers’ or cartoon drawings of cats.
- We have some form of assurance that it is accurate. For example, there are no typos, measurement errors, and the method with which it was generated is sound.
Ground truth is already complicated to determine with spreadsheet (tabular) data, but it gets even more challenging as your data and goals become more subjective.
As we design more complex AI models, ‘correctness’ starts to become more subjective. For example, if I were to ask you to summarize this article in three sentences, I’d likely get many different, equally correct responses—and possibly some bad ones too. Because of this, we also have a much harder time building trust between stakeholders and the models. Continue reading as I unpack this challenge by specifically focusing on ground truth and what stakeholders need to know to be effective partners with data science teams.
The Challenge of Complexity and Ground Truth
For data scientists, ground truth is the baseline by which we measure the performance of a model against.
With relatively simple goals, like predicting whether a patient will be readmitted within 30 days, we can observe what actually happens—in this case, 30 days later. However, as goals get more complex, like recommending amongst a set of items or summarizing clinical notes, defining ground truth becomes very subjective since many equally correct answers could be observed.
The graph below depicts the relationship between data complexity, goal complexity, and ground truth. On one axis, we have types of data, including spreadsheets, documents, photos, audio, and video, and on the other, we have common AI goals, including measuring, predicting, recommending, and creating. As data becomes more complex, it becomes harder to query.
And as models go beyond the orange arch, the likelihood of risk increases, and determining ground truth becomes more complicated. This is further amplified when scaled (the size of the dataset, and the number of predictions being made).
We’re seeing a number of generative AI fall into this category.
In one example, an Asian-American MIT student asked AI to make her headshot more professional. It generated an image that was nearly the same as her original selfie, but with lighter skin and blue eyes—features that made her look Caucasian.
What was that model’s baseline when deciding between professional and non-professional appearances? Is it correct? Is it representative of the world we live in? Is it representative of the world we want to live in?
These are all questions we are facing, more regularly with generative AI, as we determine ground truth in the ML models we’re designing. And when ground truth becomes more subjective, it becomes difficult to detect unexpected outputs—ultimately leading to less trust in the model.
>> Related Resource: How To Create Trust Between AI Builders and AI Users
What To Do When Data and Goals Increase in Complexity
Understanding the levels of data and goal complexity, and how both impact ground truth is helpful, but what do we do when we find ourselves with models that fall in the upper right quadrant of our graph above?
Below are just a few strategies that data scientists and business leaders can adopt to determine reliable ground truth and build trust in more complex ML models.
Cultivate AI literacy
If we want stakeholders to more intuitively understand why they need to be involved in example selection, they need to know what ground truth looks like. AI literacy is a tool to build this intuition.
AI literacy refers to the level of understanding and familiarity that individuals have with AI concepts, technologies, and their implications. It’s a critical component of understanding and trusting ML models, yet studies have shown that fewer than 25% of workers are data literate.
Cultivating data and AI literacy within your organization, through educational workshops—Cassie Kozyrkov’s Making Friends with Machine Learning series and her newly launched course, Decision Intelligence—or insightful articles, will significantly improve AI adoption rates and employee trust in AI-based initiatives.
Adopt a Risk Management Process That Includes Stress Testing
As models advance in complexity, adopting a risk management process that includs stress testing can help us catch the unexpected ways in which models can break.
Just like aerospace engineers test plane wings under extreme circumstances, AI builders must spend time designing the right stress tests or scenarios to understand where AI models might fail, then clearly communicate these potential risks to the stakeholders using these systems.
The AI Risk Management Framework from NIST is one great example of a risk assessment for organizations. It includes some grading of the complexity of the goal and underlying data so that teams can proactively understand the lengths to which they must go when determining ground truth.
Develop an Observability Practice
When we’re dealing with simple decisions and simple data, we can very quickly verify if the model performed well. For example, if we’re building a model that predicts whether or not a web customer is going to click a ‘buy’ button at the end of their session, within a matter of minutes, we get our answer. They either clicked it or didn’t, and we can verify what happened almost immediately.
However, as predictions increase in complexity, even slightly, verifying answers gets trickier. In one example, if we want to predict patient readmissions*, we have to wait 30 days before we get the verified answer of whether or not they were actually readmitted—which means we also have to wait 30 days before we can select our examples of readmitted patients for modeling.
Now, what happens if a patient moves out of state during this 30-day window and is seen somewhere we can’t observe? What are other consequences of longer timeframes, like 60 days or several months?
Once you have models running in production and producing predictions, an ML observability practice is essential to compare model predictions against what actually happened for two reasons.
- To continue building up a larger dataset of good examples of recordings (hello, ground truth, we see you).
- To measure how well models are actually performing.
*Patient readmissions are how likely a patient is to be readmitted to a hospital within 30 days after an inpatient visit.
It’s easy to get caught in the hype of designing more sophisticated ML models, but when it comes to building trust between stakeholders and AI, sometimes the easy solution is the better option. And if the problem does in fact call for a more complex model, be prepared to invest the time and resources into carefully defining your ground truth.
About the Author: Cal Al-Dhubaib is a globally recognized data scientist and AI strategist in trustworthy artificial intelligence, as well as the founder and CEO of Pandata, a Cleveland-based AI consultancy, design, and development firm.