In a blog post, Meta introduced Voicebox, what they call “a breakthrough in generative AI for speech.” According to the team at Meta, this state-of-the-art AI model is able to perform speech-generated tasks. Think of, editing, sampling, and stylizing.
Some of the editing that Voicebox can do for example could be removing the noise made by a dog barking or car horns. All of this can be done without losing either the content of the audio or its style. The model also comes multilingual, meaning it can produce speech in six languages. These include: English, French, German, Spanish, Polish, or Portuguese, and when given an example of someone’s speech and a pass of text in these languages, it can produce a reading of the text in any of the mentioned languages.
Interestingly enough, this can also be done if the text and audios are in different languages. Of course, one of the main benefits of a model robust is the ability to produce natural-sounding voices for programs such as virtual assistants. And as the team mentions, non-playable characters within the Metaverse itself.
But that’s not all. Voicebox’s technology can also assist those who are visually impaired to hear written messages from friends read by the AI in the voices of their friends. To imagine, you have your text messaging service connected to Voicebox, a text message from a friend or family member would sound just like any voice message from them.
According to the team at Meta. Voicebox only requires about two seconds of audio to match audio styles for use in text-to-speech generation. That’s one second less than Microsoft’s model. There’s more. As part of its noise reduction feature, Voicebox can also recreate a portion of the speech that might be interrupted by noise; or even misspoken words without the need to re-recording an entire message.
Finally, because of the data set it learned from. Voicebox has the ability to generate speech that is “more representative of how people talk in the real world” according to Meta. In short, the lines can be further blurred between man and machine when it comes to voice generation.
Check out this YouTube Short to see Voicebox in action for yourself: