Google Sends Translatotron 3 Into the Battle for Speech Translation

Google Translatotron 3 Speech Translation

Google’s Translatotron is back and better than ever — according to its proud creators, who call it the “first fully unsupervised end-to-end model for direct speech-to-speech translation.” 

The Translatotron initially appeared on the scene in April 2019 as a very early proof-of-concept to improve on the traditional speech-to-speech translation (S2ST) model. 

Standard “cascade” speech translation systems include four steps: automatic speech recognition (ASR), speech-to-text (STT) transcription, machine translation (MT), and text-to-speech. Translatotron bypassed the text translation step.

Translatotron 2, introduced in July 2021, outperformed the original, offering translation quality, speech robustness, and speech naturalness on par with the traditional cascade system. Researchers also included safeguards to prevent the model’s use in generating vocal deep fakes.

The third version of Translatotron improves on its predecessors in a few ways, most notably in its unsupervised S2ST architecture. The system was also able to “learn” S2ST from monolingual data alone.

“This method opens the door not only to translation between more language pairs but also towards translation of the non-textual speech attributes such as pauses, speaking rates, and speaker identity,” Google Research scientist Eliya Nachmani and software engineer Michelle Tadmor Ramanovich wrote in a December 1, 2023 blog post.

Fierce Competition

Speech translation is a hot topic in Silicon Valley. In November 2023, Google competitor Meta released its own AI model, Seamless, which reportedly translates speech in real-time, with a consistent vocal style.

While Google prides itself on Translatotron’s omission of text translation, Meta promotes Seamless’ abilities to handle ASR and STT translation for almost 100 input and output languages. Its STT translation works from nearly 100 input languages into 36 target languages.

The authors proposed their use of back-translation (i.e., unsupervised MT generating a synthetic translation of text in the source language) as a way to eliminate the need for bilingual speech datasets for unsupervised S2ST.

Translatotron underwent a two-part training process. The first part focused on auto-encoding input, while the second trained the network to translate input (i.e., via back-translation). 

Nachmani and Tadmor Ramanovich co-authored, with Google Research scientists Alon Levkovitch and Chulayuth Asawaroengchai, plus Google DeepMind’s Yifan Ding, Heiga Zen, a June 2023 paper that detailed Translatotron 3’s capabilities. 

The team compared Translatotron 3’s performance in Spanish-English translation (both directions) to a cascaded S2ST system that used ASR, unsupervised MT, and TTS. 

“Translatotron 3 outperforms the baseline by large margins in every aspect we measured: translation quality, speaker similarity, and speech quality. It particularly excelled on the conversational corpus,” the authors wrote. “Moreover, Translatotron 3 achieves speech naturalness similar to that of the ground truth audio samples.” 

Future work — which may entail the debut of Translatotron 4 — might explore more languages, zero-shot S2ST coupled with back-translation, and back-translation for different types of speech data, such as noisy speech and data from low-resource languages.