S2ST completely bypasses the step of transcribing audio and translating the resulting text, making for a sleeker process. One major player investing in S2ST is Google, whose June 2021 revamped Translatotron 2 reportedly outperforms its 2019 precursor.
Still, STT has its place and its proponents, even among companies exploring S2ST. In September 2021, video conferencing platform Zoom unveiled plans to offer live, multilingual transcription and translation for calls.
Now, self-described “AI community” Hugging Face, a software company that relies heavily on volunteers to experiment with datasets and machine learning models, has announced the release of state-of-the-art speech translation models created in partnership with Facebook AI. The models build on previous collaboration on another base model, Wav2Vec 2.0, introduced in December 2020 and which was downloaded more than 29,000 times since early September 2021.
Hugging Face’s initial tweet linked to a model — downloaded 801 times since September 2021 — that allows users to convert English speech into German text. Facebook AI, however, stated in its announcement that the speech-to-text models can also translate English into Arabic, Catalan, and Turkish; and linked to the four models.
Critics did not take long to point out the models’ shortcomings and their skepticism. One admitted the models are “exciting stuff, but still a long way before the quality of the Arabic translations are sufficient. For example, the second en-ar sample case provided is incorrect.”
Another tweet, rife with sarcasm, asked Facebook, “Oh so you’ve made improvements since this?” followed by a link to news coverage of an infamous 2017 incident in which Facebook’s faulty machine translation led to an innocent man’s arrest.
Facebook AI, of course, remained upbeat, calling the speech-to-text models simply “another step toward eliminating language barriers.”