What Can Go Wrong When Creating Data to Enable Multilingual AI  What Can Go Wrong When Creating Data to Enable Multilingual AI 
Editor’s note: Olga is a speaker for ODSC East 2022! Be sure to check out her talk, “Creating Data to Enable... What Can Go Wrong When Creating Data to Enable Multilingual AI 

Editor’s note: Olga is a speaker for ODSC East 2022! Be sure to check out her talk, “Creating Data to Enable Multilingual AI: What Can Go Wrong and Ways to Mitigate It,” there!

Artificial intelligence (AI), and conversational AI as one of the fastest-growing sub-domains within AI, has a broad range of use cases in customer engagement, operations, and supply chain management across the globe.  

The global reach of business has given rise to a need to collect and develop data for AI applications that both understand (natural language understanding) and generate (natural language generation) text and speech in multiple languages. Vast volumes of varied quality multilingual data sets are essential for enabling the optimal performance of the models behind such applications. But we need to be very mindful of everything that can potentially go wrong in the process of developing these massive multilingual datasets. Over the last few years, AI data experts and data scientists have come across a variety of issues that they could not even think about prior to embarking on the global AI applications development journey.  


So, what can go wrong when creating data to enable multilingual AI? Problems that we have come across, include: 

  • Sociolinguistics: both language and acoustic model training datasets can be limited to catering to just certain in-country demographics including gender, ethnicity, age group, and education level, which will significantly limit the customer engagement with your conversational AI product or voice search engine.When developing for local audiences you can also introduce or carry over cultural phenomena from your English dataset that are not relevant for that geographic.  This is, in my opinion, different from the data contribution to potentially harmful (unless intended) AI model social biases which I’ll be discussing below.  
  • Engineering: multiple challenges are related to developing datasets for various locales. Often, similar to internationalization, if not addressed in advanced, software localization issues can arise, including code-switching, various glyph sets within a single language (think, for example, about four writing systems in Japanese), challenges specific to bi-directional languages, and ways of handling user errors such as repeated words, typos, and homophones. 

    There are also requirements specific to spoken modality (vs. written modality) which will be harmful to the language model utility if not represented properly. For example, the user’s speech will not be understood (it is on the language model to understand the speaker, as the acoustic model is supporting the phonetic mapping). Such conventions include alphanumeric values spelled out as words, filler words, special characters and punctuation, initialisms vs. acronyms (W.H.O vs. NASA, but ASAP being an edge case), abbreviations, etc.

  • Bias and Inclusion: your models may produce gender, racial, age, and other social biases—driven both by algorithms and underlying data—which is a phenomenon recently getting a lot of attention due to the issues it has caused around diversity, equity, and inclusion across multiple industries.   

Luckily, there are several techniques in both managing your data (pruning, augmentation, assigning weights and other state-of-the-art methods we’ll talk about) and tweaking your algorithms that can help you control and reduce this bias.   

The good news is, in my upcoming session at ODSC East, “Creating Data to Enable Multilingual AI: What Can Go Wrong and Ways to Mitigate It,” I will share ways to mitigate these problems based on natural language processing (NLP) and other engineering solutions, and data creators training approaches  Welocalize has developed over the course of collaborating with its clients’ data scientists.

About the author/ODSC East 2022 Speaker: Olga Beregovaya, VP, AI Innovation at Welocalize, Inc.

A seasoned professional with over 20 years of leadership experience in language technology, NLP, ML and AI data generation and annotation, Olga is the VP, AI Innovation at Welocalize. She is passionate about growing business through driving change and innovation, and an expert in building things from scratch and bringing them to measurable success. Olga has experience on both the buyer and the supplier side, giving her a unique perspective around establishing strategic buyer/supplier alliances and designing cost-effective Global Content Lifecycle Programs. She has built and managed global production, engineering, and development teams of up to 300 members specializing in NLP and broader ML and AI.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.