Research into speech-to-speech translation (S2ST) — that is, technology that converts a speaker’s or actor’s words from one language to another — is booming.
In the fourth quarter of 2022 alone, preprint research repository arXiv featured 27 papers related to S2ST, with a number of household names on the roster.
Driving the boom in research is the wide variety of applications for S2ST, from live translation of video calls all the way to machine dubbing (or AI dubbing) and Mark Zuckerberg’s vision of a boundless metaverse.
It’s no coincidence, then, that Meta AI contributed six papers, including a November 2022 writeup introducing SpeechMatrix, which the authors described as the “largest freely available speech-to-speech translation corpus.”
Another November 2022 paper from Meta AI focused on building a system to support S2ST for languages without standard writing systems, using Taiwanese Hokkien as a case study.
With three papers released during the quarter, Microsoft was the second most prolific source of S2ST research. The company’s October 2022 paper is notable for proposing a model jointly pretrained with unpaired speech and bilingual text data to improve direct S2ST.
Some of Microsoft’s biggest competitors are already quite familiar with the “direct” method of S2ST, which bypasses the traditional steps of automatic speech recognition and machine translation.
Big Tech Meets Academia
Meta AI introduced BLASER, a text-free evaluation metric to avoid ASR in S2ST, in a December 2022 paper. (This followed the company’s June 2022 publicity surrounding new multilingual textless S2ST methodology, reportedly resulting in the first S2ST framework “trained on real-world open sourced audio data.”)
Google, meanwhile, debuted its S2ST system, Translatotron, in 2019. In July 2021, Google claimed that the second iteration, Translatotron 2, outperformed the original in translation quality, speech robustness, and speech naturalness.
The two papers from Australia’s Monash University in Melbourne explored solutions to technical issues in S2ST; namely, addressing the high computational cost of using pre-trained speech Transformers for SOTA results and the absence of large-scale data for most language pairs and domains.
Similarly, a December 2022 paper by ByteDance AI-Lab went a step beyond enlarging datasets and proposed “Mix at Three Levels for Speech Translation,” a method for increasing the diversity of an augmented corpus. The company’s second paper explored a method for working around textual annotation in Mandarin-Cantonese S2ST.
Other notable contributors of at least one paper to the flurry of S2ST papers in Q4 2022 were Alibaba, Tencent, Google Research, and Carnegie Mellon University.