Researchers at Google have released the latest iteration of Translatotron two years after the speech-to-speech translation (S2ST) system’s April 2019 debut (SlatorPro) — and it outperforms its precursor on several levels, according to software engineers Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz.
In a July 2021 paper, Translatotron 2: Robust direct speech-to-speech translation, the Google team said that, in their experiments, Translatotron 2 outperformed the previous version in terms of translation quality, speech robustness, and speech naturalness to a level comparable to the traditional cascade system.
Quick recap: A typical S2ST cascade consists of several sub-systems, most commonly automatic speech recognition (ASR) and transcription, followed by text machine translation (MT), and then text-to-speech (TTS) synthesis in the target language. Google’s Translatotron project omits the text translation step.
Although work on so-called “direct” S2ST, which bypasses ASR and MT, is limited, there are a number of benefits, including greater ease in working with languages that lack a written form and in handling content that does not require translation, such as proper nouns.
Translatotron 2 comprises three parts, connected by an attention module: a source speech encoder; a target phoneme decoder; and a target mel-spectrogram synthesizer. The model is jointly trained with a speech-to-speech translation objective and a speech-to-phoneme translation objective.
The original Translatotron could generate translated speech in a different voice using either a clip of the target speaker’s audio (as reference audio for the speaker encoder) or the embedding of the target speaker. While this capability is potentially useful in industries such as film and gaming, it also made Translatotron “ripe for potential misuse.”
Translatotron 2 takes a different approach to prevent its use in deepfakes. The trained model is restricted to retaining the source speaker’s voice, and the model cannot generate speech in a different speaker’s voice.
Another related improvement is the ability to retain original voices for “speaker turns,” which the authors noted would be challenging for cascade systems. Using as a starting point a TTS model that preserves voices through translation, researchers augmented training data so that Translatotron 2 could learn on examples with speaker turns.
The researchers added that these kinds of modifications can increase “the diversity of the speech content as well as the complexity of the acoustic conditions in the training examples, which can further improve the translation quality of the model, especially on small datasets.”