What’s the Main Priority for Data Labeling in Modern ML, Quality or Scale: Experts Weigh In What’s the Main Priority for Data Labeling in Modern ML, Quality or Scale: Experts Weigh In
AI relies on data labeling for training algorithms – without the data, there can’t be any machine learning. Data labeling accuracy... What’s the Main Priority for Data Labeling in Modern ML, Quality or Scale: Experts Weigh In

AI relies on data labeling for training algorithms – without the data, there can’t be any machine learning. Data labeling accuracy is often difficult to achieve on its own, but it becomes even more of an issue when scalability is at stake. It’s believed by many that it’s either the quality or the quantity that can be prioritized but not both at the same time – there’s a tradeoff.

For example, if a training model requires labeled images of human facial expressions, it can either get 1,000 flawlessly labeled images or 10,000 images of acceptable but not immaculate quality. Which is better? A unique mix of experts from the data labeling industry, among them researchers, practitioners, and crowd performers from Toloka met at the VLDB2021 workshop to discuss this question from a number of different perspectives.

Opinions differ across various fields 

According to Grace Abuhamad, Applied Research Scientist in Trustworthy AI at ServiceNow, the quality-quantity tradeoff is one of the most pressing dilemmas today. From the vantage point of business and end-user satisfaction, product quality is of the utmost importance, and managing product quality—that relies mainly on data quality—remains a priority. On the other hand, some specialists, particularly those who favor the model-centric (as opposed to the data-centric) approach, purposely opt for those training algorithms that either require less accurate data or less data in general. This is one of the ways to circumvent the issue of scalability. 

Mohamed Amgad, a Predoctoral Pathology Fellow from Northwestern University, argues that this is very much a domain-dependent scenario. Such tricks won’t work in certain fields, medicine being one of them. More specifically, in computational pathology, the data must be perfect, and there has to be a lot of it. When AI has to detect a potential medical risk by analyzing microscopic tissue and offer a prognosis for the patient, a mistake can mean a one-way trip to the graveyard. In this situation, the lesser of two evils simply cannot be chosen, because neither large quantities of poorly labeled data nor small quantities of adequately labeled data will suffice. In many cases, we’re talking about hundreds of thousands if not millions of images that need to be fed into the training model, each one of them validated by at least three people, all of whom must be not only skilled labelers, but highly qualified professionals in the medical field.


The jury is still out on the tradeoff of data labeling approaches

Another problem that Mohamed talks about is the sample problem. Even if the data is accurately labeled and there’s enough of it, where did it come from? Turns out that quite often, AI specialists attempt to produce a lot of medical data from a small number of unique patients. While it may be quality data and there may be plenty of it at first glance, it’s not inherently scalable data, because it represents only a limited number of cases. So, while the problem of quality remains at the center of medical AI development, what is really missing is sufficient quantities of the labeled data to account for variability.

How do we come to terms with this interconnectedness and interdependence of quality and quantity? Attempts have been made to look at it from the standpoint of the machine itself, including initiatives from HCOMP and Facebook, for example cats4ml.humancomputation.com. Ivan Stelmakh, a PhD candidate at Carnegie Mellon University, argues that before we decide which track should be taken, we should understand the tradeoff better from a zero-sum-game perspective in the experimental setting. Ivan says that large language models, for instance, can learn from anything, but it’s very difficult to determine and guarantee the reliability of every piece of information. The models themselves should guide us – we should try sacrificing the quality in favor of the quantity and vice versa, and see how each outcome will impact the performance of these language models. 

What is worth sacrificing (first)?

Mohamed agrees with this proposition, though he reminds us that hard science fields don’t offer us such a leisurely choice: we need to have both these factors in place for the model to be of use. Having said that, in the medical field, volume—that is having multiple data points—is more important for the model’s training, whereas quality—that is the accuracy and verifiability of these data points—is more important for testing and validating the model’s overall performance. In short, we should prepare and study experimental data sets and algorithms before we implement them in the real world. 

Zack Lipton, Assistant Professor at Carnegie Mellon University, explains that depending on the quality of the data, it is possible to predict what level of scalability can be feasibly accomplished. Some quality issues can be overcome at scale – some cannot. In some fields, unlike the medical sphere, the question of what’s better—a small amount of high-quality data vs. a large amount of low-quality data—may be appropriate. Zack goes on to say that in some situations (provided we do a majority vote or run another sophisticated algorithm), it’s actually better to receive a bigger data set with each item being labeled only a single time as opposed to a smaller set with multiple labels/labelers per item.

Jie Yang, Assistant Professor at Delft University of Technology, insists that the issue of quality is, from critical perspectives like safety and fairness, a bigger concern than quantity. For this reason, Jie opposes the heavy focus on scalability at the cost of losing quality. He explains that we have already witnessed several unfortunate instances when an attempt to produce more data backfired, resulting in technical issues, inferior products, and unavoidably skyrocketing costs. For Jie, we should (re)address the question of quality, first and foremost, especially today when AI is widely deployed in domains that are safety, trust, and ethically sensitive.

Ujwal Gadiraju, another Assistant Professor at the same university, also sees quality as being of prime importance. Ujwal argues that while it’s hard to find an optimal balance between quality and scalability, the research community tends to incline towards the quality-related aspect of the equation. This is particularly true with the model-driven approach when models are trained on predetermined data. The volume alone will not fix the problem there. Plus, scalability isn’t as big an issue as it used to be, because we’ve learned to deal with the lack of training data through data-centric approaches like crowdsourcing, says Ujwal.

Crowdsourcing and the role of performers

Konstantin Kashkarov, a crowd worker with Toloka, believes that crowdsourcing platforms allow for scalability without reducing data labeling quality, provided all protocols are followed thoroughly. Novi Listyaningrum, a graduate student at Institut Kesenian Jakarta and part-time Toloker, believes that the quantity problem has been completely solved at Toloka. And she thinks that adequate quality can be maintained, too. What this means is that by (a) making sustainable use of the global crowd power and (b) taking a socially responsible stance and offering opportunities to individuals, a win-win situation for both the platform and the labelers has unfolded.

Olga Megorskaya, CEO of Toloka, argues that the labelers will do their part well if they’re explained exactly what to do. This starts with clearly written instructions that contain plentiful examples. If this prerequisite is met, crowdsourcing in and of itself can tackle the rest. As a result, those with a longer history of ML production and understanding of the delivery pipelines hold an advantage and can achieve better results compared to those who can’t make their goals completely transparent to the labelers.

Key takeaways

  • There exists a quality vs. quantity tradeoff in data labeling.
  • Some fields, like medicine, cannot sacrifice one in favor of the other.
  • Some fields can accept larger but lower quality data sets in favor of the more accurately labeled sets that fail to provide enough data points.
  • Those in the research community tend to tilt towards quality as being the more important of the two aspects.
  • The quality vs. quantity tradeoff overlaps with the model-centric vs. data-centric approach to AI development debate.
  • Data-centric methods like crowdsourcing have been able to solve the issue of scalability.
  • The key to solving the issue of quality in crowdsourcing largely lies in the hands of the requesters who should always provide clear instructions and offer both financial and non-financial incentives for more accurately/quickly labeled data sets.

About the Author:

Daria Baidakova, Director of Educational Programs at Toloka, is responsible for consulting and educating Toloka’s requesters on integrating crowdsourcing methodology in AI projects. She is a co-author of hands-on tutorials on efficient crowdsourcing at WSDM’20, CVPR’20, SIGMOD’20, WWW’21, and NAACL’21 as well as the co-organizer of the crowd science workshop at NeurIPS’2020.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.