The Top Challenges of Speech-to-Speech Generative AI The Top Challenges of Speech-to-Speech Generative AI
While interesting generative AI use cases—mostly in speech-to-image generation from well-known models, like Stable Diffusion and DALL-E— have emerged in recent... The Top Challenges of Speech-to-Speech Generative AI

While interesting generative AI use cases—mostly in speech-to-image generation from well-known models, like Stable Diffusion and DALL-E— have emerged in recent years, the business value of the technology has remained largely untapped. And while image and video have their place in the enterprise, language is becoming a bright spot. 

Language, particularly in spoken form, is the great uniter, and breaking down barriers that prevent effective communication will have an enormous impact, not just on business, but society at large. And finally, we’re starting to see momentum in this area. 

But, while recent advances in generative speech-based machine learning technologies have made massive strides, we still have a long way to go. In fact, voice compression—which occurs in apps we rely on heavily, like Zoom and Teams—is still based on tech from the eighties and nineties. 

While speech-to-speech technology has endless potential, it’s vitally important to assess the challenges and shortcomings creating barriers for generative AI to thrive. Here are three common pitfalls AI practitioners face when it comes to implementing speech-to-speech technologies.

1.) Sound Quality

The most important part of optimal communication is that it’s understandable. In the case of speech-to-speech technology, the goal is to provide optimal, human-like results. There are a few reasons why this is hard to achieve through AI, but the nuances in conversation play a big part. 

According to Mehrabian’s Rule, human conversation can be broken down to 3 parts: facial expression, tone of voice, and words. Machine understanding relies on text to operate, and only recent strides in natural language processing (NLP) have made it possible to train AI models on factors like sentiment, emotion, timbre, and other important, but not necessarily spoken, aspects of language. If you’re just working with audio, not visual, this becomes even more challenging without the understanding that comes from a queue such as facial expression. 

2.) Latency 

Analysis to AI synthesis can take time—but with speech-to-speech communication, real-time is typically what matters most. Voice conversion must happen immediately as the speaking is taking place, and translate accurately. Issues with latency have proven to be a big hurdle for many of the speech-to-speech solutions on the market today. 

However, the necessity of real-time conversion can vary by industry. For example, a content creator doing a podcast  may be more concerned with sound quality than real-time voice conversion. But for industries like customer service, time is of the essence. If a call center agent is using voice assistive AI to respond to a customer, they may be able to sacrifice on optimal quality, but time is crucial to providing a positive user experience. 

3.) Scale 

In order for speech-to-speech technology to live up to its potential, it must support various accents, languages, dialects, and be usable for everyone—not just specific geographies or markets. This takes mastery of specific applications of the technology and a great deal of tuning and training in order to effectively scale. 

Emerging tech solutions are not one-size-fits-all, and all users will need to support this AI infrastructure with thousands of architectures for a given solution. Users should also expect to test models consistently. While it sounds like a tall order, quality, time, and scale are all classic challenges of machine learning—many that we’ve already overcome in other areas of AI. 

Tackling the Challenges of Speech-to-Speech Technology

The first step to overcoming challenges with speech-to-speech technology is quite obvious: master the problem. It’s vital to know the desired outcome of your generative AI project before you start breaking it down into small, digestible solutions. Back to the call center vs. a content creator example, the application in which the technology is being used makes a difference.

Next, ensure your organization has the right data, architecture, and algorithms in place. Data quality matters, especially when you’re considering something as sensitive as human language and speech. Nailing down your goals and the resources and support your project will require go a long way when implementing speech-to-speech technology. 

Speech-to-speech technology has the potential to revolutionize the way we understand one another, opening opportunities for work, travel, fun, and more. But in order to get there, we must work to overcome the challenges in front of us. 

Article by Yishay Carmiel, CEO, Meaning

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.