It has been just two months since Microsoft researchers demoed VALL-E, a text-to-speech (TTS) model that can convincingly mimic your voice based on a 3-second recording. Now, with VALL-E X, they have extended it with a multilingual dataset and translation modules to convert a person’s voice into another language based on a single utterance.
The VALL-E models draw inspiration from the success of large language models in text generation. Instead of training on small, carefully curated datasets of studio-recorded speech, they learn from huge volumes of semi-supervised data. This diverse, multilingual, and multi-speaker speech data is derived from open-source corpora, some pre-labeled and the rest automatically transcribed.
Like the original VALL-E, one of the more impressive aspects of VALL-E X is its ability to preserve not only the distinctive features of the speaker’s voice but also their emotion and acoustic environment. This means that if the sample recording takes place in an echoey chamber and the speaker sounds angry, these characteristics will persist in the generated audio. Thanks to the size and diversity of the training data, the model is able to learn and effectively replicate these features in synthesis.
As an added bonus, the researchers found they could adjust the foreign accent of the synthesized voice to make it sound more native, alleviating a known issue in cross-lingual TTS. The model was even able to handle code-switching fluently despite a lack of examples in the training data. This mixing of multiple languages in speech, which is characteristic of many multilingual communities, is a tricky area for traditional TTS.
Microsoft has so far disclosed little about its intentions for VALL-E X and has not yet released the code.
Readers may judge the quality of the generated speech by listening to the demo. It includes examples of the model’s performance on a range of speech generation tasks, from cloning the voice from an input prompt when converting text to speech in a different language to speech-to-speech translation, foreign accent control, emotion maintenance, and synthesis of code-switching utterances.
Unexpected Arrival
For now, there is just a single language pair, Chinese↔English, but the next step is to expand to other languages. It should be straightforward for high-resource languages. One of the major benefits of Microsoft’s novel approach is that it does not require hard-to-source paired bilingual data from the same speakers for training. For low-resource languages, leveraging knowledge from high-resource languages with transfer learning and data augmentation or language agnostic meta learning could yield promising results.
The research paper itself reveals very little about Microsoft’s intentions for VALL-E X, and for now, they have not released the code. There are many potential uses, starting with the obvious speech-to-speech translation tasks that big tech has been working on for years — multilingual communication and machine dubbing, but with the bonus of preserving the voice of the original speaker.
Along with the likes of Meta, Google, and Amazon, a whole raft of startups have sprung up specializing in this space. ElevenLabs is one intriguing example. Despite receiving a lot of negative press earlier this year when it released its voice-cloning tool to beta testers, there is no denying the technology is impressive. In fact, it is precisely for that reason it was so effectively abused to create deepfakes of celebrities spewing hate speech, spoof voice ID to break into a bank account, and spawn a new meme of US presidents trash-talking while gaming.
Although the unexpected arrival of VALL-E X may have somewhat stolen its thunder, ElevenLabs plans to release its “automatic dubbing tools that let you speak the language you don’t” later this year. These days, it seems that the next big AI breakthrough is never far off.