Meta AI has a new generative AI model for speech called Voicebox. The model was introduced on June 16, 2023, less than a month after the company announced its Massively Multilingual Speech (MMS) project — a model said to have speech-to-text and text-to-speech abilities in 1,100 languages as well as language identification skills in 4,000 spoken languages.
The company is touting Voicebox as “the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish.” No typo there, the blog says “generalize to speech,” and generalizing to something in machine learning means essentially to make a model adapt on the fly to data it was not trained on when it was created.
The team at Meta AI stated that it considers the need for training a main limitation of other speech synthesizers. But as with any models, training was indeed needed to create Voicebox to begin with. In this case, over 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese were used.
To perform speech task generalizations, Voicebox learns from raw audio or from audio paired with text. When the input is text, different voices can be generated in any of the six languages. When both text and a reference audio clip are used, the output is audio generated using the text with the characteristics of the reference voice.
🔉 Introducing Voicebox: a #generativeAI model that can help with audio editing, sampling and styling.— Meta Newsroom (@MetaNewsroom) June 16, 2023
In the future, this technology could help creators easily edit audio tracks for videos, allow visually impaired people to hear written messages from friends in their voices,… pic.twitter.com/2h8GAytb28
Speaker Says Qué?
Besides voice generation, the model is capable of generating multilingual audio output: input can be given in one language and output can be generated in another language.
The multilingual feature is not unique to Voicebox. Columnist Rowan Cheung, one of the first people to react to Meta AI’s news about the Voicebox model on Twitter, says the company is “about to go compete” with Play.ht and ElevenLabs (one of Slator’s 50 Under 50 AI companies), both of which announced multilingual capabilities in April 2023.
Meta AI researchers say Voicebox can also perform other voice/audio tasks, including inserting audio bites in the middle of a recording without having to create the entire input again, transfering the style and vocal characteristics of a speaker across languages, removing background noise, and correcting content.
Meta AI is on fire.— Rowan Cheung (@rowancheung) June 18, 2023
They just announced Voicebox, a multilingual high-quality text-to-speech AI.
The quality is so good that they're not making the Voicebox model or code publicly available (yet) to avoid misuse.
Sounds like it's about to go compete with ElevenLabs/PlayHT. pic.twitter.com/Ws733Aqtlo
Examples of these features can be heard in the video clip shared in the announcement, but the company says the technology itself is not for sharing. Meta AI researchers explained in the blog that they are not making the Voicebox model or code publicly available out of concern for the potential risks of misuse.