For some of the biggest names in AI, advancements in language technology can seem like a proxy for an arms race. And on May 22, 2023, Meta threw some shade at its competitor OpenAI, home of multilingual automatic speech recognition and transcription technology Whisper.
To introduce Meta’s Massively Multilingual Speech (MMS) project, VP and Chief AI Scientist Yann LeCun tweeted to praise the model’s speech-to-text and text-to-speech abilities (1,100 languages) and language identification skills (4,000 spoken languages). The final touch: “half the word error rate of Whisper.”
Responses on Twitter were mixed. Fans expressed awe at the sheer number of languages covered, as well as the pace of development.
“4000 languages and I am speechless…” gushed one tweet. Another observed, “Oh wow this is huge! This model covers dialects for which it was impossible to build a strong dataset ! But somehow with Meta’s huge conversational base, it became possible!”
Oh wow this is huge ! This model covers dialects for which it was impossible to build a strong dataset ! But somehow with Meta's huge conversational base, it became possible ! Great work Meta ! And great work @ylecun 🙌— BoredGeekSociety | AI & Automation (@BoredGeekz) May 22, 2023
A third tweet suggested that “speaker diarization […] would really set it apart from whisper and so many useful applications.”
Critics took issue with Meta’s comparison to Whisper. “The model looks good; the evaluation looks terrible,” NLP researcher Benjamin Marie tweeted. “Many of the WER reported/copied from previous work are not comparable.”
Self-described AI architect Daniel Monge pointed out that LeCun’s statement did not seem to account for “more widely used languages, like English, Spanish, Portuguese, etc.”
“I mean…Of course it’s going to perform better at a language less than 0.1% of the world speaks since the end-goal was to address the long tail of languages in the first place,” Monge explained.
It would be nice to see an apples-to-apples comparison with Whisper when it comes to more widely used languages, like English, Spanish, Portuguese, etc… I mean… Of course it's going to perform better at a language less than 0.1% of the world speaks since the end-goal was to…— Daniel Monge (@MongeMkt) May 22, 2023
Indeed, Meta’s own PR for MMS emphasized that the system can provide “speech-to-text, text-to-speech, and more for 1,100+ languages,” and touted the long list of supported languages as “a 10x increase from previous work.”
The research paper introducing MMS seemed to hint at Meta’s plans to expand into speech translation for even more languages.
“Even though we built speech systems supporting between 1,100-4,000 languages, there are currently over 7,000 languages being spoken around the world today,” the authors wrote. “Moreover, there are many dialects which are often inadequately represented in the training data, even for high-resource languages such as English.”
Whisper, meanwhile, is quickly becoming embedded in production. Captioning leader AI Media made Whisper available through its cloud service, HappyScribe switched to making their transcription solution based on Whisper with finetuning, and in early May 2023, integration SaaS Zapier started offering a Whisper connector.