In November 2022, Google announced its 1,000 Languages Initiative, which aims to develop an artificial intelligence (AI) model that can support the 1,000 most widely spoken languages. As part of this initiative, Google also introduced its Universal Speech Model (USM).
Google said back then that the USM is trained on over 400 languages, “making it the largest language coverage seen in a speech model to date.” The primary goal of this effort is to promote greater inclusion for billions of people belonging to marginalized communities across the globe, as per Google’s claims.
On March 6, 2023, Google provided additional details about the USM. More specifically, in the research paper Google describes USM as “a family of state-of-the-art speech models” with 2 billion parameters trained on 12 million hours of speech and 28 billion sentences of text, covering over 300 languages.
Scaling Speech Technologies
The USM is mainly for use in YouTube (e.g., for generating closed captions) and can perform automatic speech recognition (ASR) in over 100 languages. This extends beyond widely-spoken languages like English and Mandarin, to include under-resourced languages such as Amharic, Cebuano, Assamese, and Azerbaijani, among others.
For some of these languages it is “very hard to find the necessary training data,” because they “are spoken by fewer than twenty million people,” Google explained. This is “a fundamental challenge in scaling speech technologies to many languages,” they added.
The USM uses a typical encoder-decoder architecture, and the training process has three steps. The first step involves self-supervised learning on speech audio covering hundreds of languages. In the second step, the model’s quality and language coverage can be further improved through pre-training with text data (this is optional depending on the availability of text data, but the USM performs better when this second step is included, Google said). The last step fine-tunes the model on specific tasks — such as ASR or automatic speech translation (AST) — using only a small amount of supervised data.
Google demonstrates that pre-training the encoder of the model using a large unlabeled multilingual dataset and then fine-tuning it on a smaller set of labeled data can help in identifying under-represented languages. They also claim that the proposed training process is “effective at adapting to new languages and data,” they said.
Accessibility and Inclusion
The USM seems important to the search giant. In the post, Google says that they “believe USM’s base model architecture and training pipeline comprise a foundation on which we can build to expand speech modeling to the next 1,000 languages”.
In addition, Jeff Dean, SVP of Google Research and machine learning legend, wrote in a Tweet that “it will likely improve over other speech systems” as well.
The USM achieved state-of-the-art performance on multilingual ASR and AST for multiple datasets in multiple domains. More specifically, Google compared the USM against public pipelines, including Whisper, and found that the USM outperforms Whisper, achieving a lower word error rate (WER).
(Research paper authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu)