Among the most active players in the space is Google. The search giant introduced its Translatotron S2ST system in 2019 and the second version in July 2021. Researchers Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayuth Asawaroengchai, Heiga Zen, and Michelle Tadmor Ramanovich at Google’s research laboratory DeepMind now announced the third iteration of the direct S2ST model in a paper published on May 27, 2023.
Translatotron 3 is an enhanced version of its forerunner, Translatotron 2, which according to researchers already offered superior translation quality, speech robustness, and speech naturalness.
Harnessing persistent challenges with limited speech datasets, the team claims to have achieved “The first fully unsupervised end-to-end model for direct speech-to-speech translation” with this third iteration of the model.
Unsupervised training means the model learns and makes inferences from unlabelled data without having predetermined answers. Rather than being trained in a conventional approach, such as employing massive bilingual corpora, the model ends up independently finding consistent patterns and regularities in the given data.
Notably, the model relies on monolingual speech-text datasets in the training phase. The need for a bilingual dataset is compensated by a technique known as “unsupervised cross-lingual embedding mappings.” In this technique, researchers train word embeddings independently in both languages and then map them in a shared space through self-learning.
In other words, the model first learns the structure and nuances of each language separately. Then, it uses what it has learned to find a common ground to link to and relate to the intrinsic qualities and specificities of both languages. The resulting cross-lingual embeddings are used to initialize a shared encoder that handles and understands both languages equally.
The model further improves itself with the help of a masked autoencoder. This means that, in the encoding phase, this tool is only provided with a portion of the data, and during the decoding stage, it must infer or predict the information that has been hidden. This “guessing game” pushes the model to make more meaningful decisions.
In addition to this, the model employs a back-translation technique for self-checking, much like a human would. This method ensures that the translation is coherent and accurate.
Traditionally, S2ST has been tackled under a cascaded approach that pipelines automatic speech recognition + machine translation + text-to-speech synthesis. Conversely, Translatotron 3 relies on a novel end-to-end architecture, directly mapping source language speech to the target language without relying on intermediate textual representation.
In this context, Translatotron 3 outperforms the cascaded counterparts, as measured by 18.14 BLEU points improvement.
Besides improved accuracy, the end-to-end approach proves to be effective in preserving para- and non-linguistic information. Since it directly links source speech to the target language, it is capable of successfully transferring various characteristics inherent to the input speech as well as the original speaker’s identity and the naturalness of the voice.
The researchers claim that Translatotron 3 also captures other traces of non-verbal information such as pauses, speaking rate, intonation, etc. The latter has the potential to establish new standards in the field, as S2S translation captures both meaning and speaker nuance. And the unsupervised training development may have interesting effects on how similar S2ST models are trained in the future.