Speechless Recognition: Can AI Transcribe a Language It Has Never Heard?

Speech Recognition for Low Resource Languages

The short answer is yes, but with a word error rate (WER) up around 70%, one might wonder why it would bother.

It turns out there are a number of compelling reasons, from commercial to academic to humanitarian. We are now in the second year of what the UN has proclaimed the International Decade of Indigenous Languages in response to alarming predictions about the future of linguistic diversity. Of over 7,000 languages spoken around the world today, almost half are considered endangered, threatening the cultures and knowledge systems they are integral to.

While globalization and colonization have been driving language loss for centuries, there is concern that our increasingly digitized world, which caters to only a tiny handful of the world’s languages, is accelerating this process.

However, the very technology that concentrates human-machine collaboration in a few dominant languages can also be used in language preservation and revitalization efforts. Automatic speech recognition (ASR) is a valuable tool for language documentation, particularly in the absence of resources for human transcription, and can enhance language learning and translation tools.

Traditionally, ASR systems are trained on paired audio and transcription data in the target language. Although recent breakthroughs in multilingual speech recognition such as Meta’s XLS-R and Google’s Universal Speech Model improve performance for lower resource languages by pre-training on huge quantities of unlabeled data, they still fine-tune for ASR on labeled speech. OpenAI’s Whisper, which boasts human levels of accuracy for English transcription, pre-trains on multilingual paired data.

So what about languages with no labeled speech data? Or no speech data at all?

A Daunting Task

Researchers at Carnegie Mellon University (CMU) are investigating ways to expand ASR support from a few hundred languages to thousands. A key motivation for the research is language preservation, so their focus is on endangered languages for which audio data is scarce or unavailable. The ASR2K pipeline they presented at Interspeech 2022 holds promise, although an average WER of 70% hardly makes it an enticing alternative to human transcription as of yet.

To be fair, transcribing an unknown language is a daunting task for humans, too, even linguists specially trained in phonetic transcription. This method of representing pronunciation with a set of symbols corresponding to speech sounds or phones has several benefits for endangered languages and is key to ASR2K’s ability to decode an unheard language.

Phones are relatively language-independent, so it is feasible for a model to learn to recognize these based on sufficiently diverse multilingual audio data. This is precisely what ASR2K attempts. Thanks to the work of field linguists over countless decades, these can in turn be mapped to phonemes, a different type of speech unit which tend to have closer correspondences in writing systems.

To convert phoneme representations to likely word sequences, ASR systems use language models (LMs) trained on text corpora, often with the aid of a pronunciation dictionary. For state-of-the-art ASR systems, the diversity and size of the LM play a decisive role in transcription accuracy. The CMU researchers also found the more text data they could provide in the target language, the better the performance of ASR2K.

However, text data is also scarce for endangered languages which often lack a standardized orthography. Some are without keyboard, font, and/or Unicode support for their writing system, and many are purely oral. Although producing a phonetic transcription could have some use in language documentation, the accuracy is likely to be questionable without the assistance of an LM, and the lack of word boundaries would make it difficult to read and analyze.

Fortunately, research has revealed some encouraging alternatives for unwritten languages. Speech-to-meaning models can be trained to learn semantic representations for speech and map these to translated text or images. By neatly bypassing the need for a standardized writing system, this opens up a world of possibilities in speech technology for oral languages.

No Niche

If these sound like niche applications of ASR for academic and humanitarian purposes, think again. 

Massively multilingual expansion has become a priority for major tech companies, from Amazon’s goal to scale virtual assistant technology to 1,000 languages to Google’s 1,000 Languages Initiative

Meta’s No Language Left Behind project has already developed a speech-to-speech translation system for the primarily oral language of Hokkien using a translated text intermediary.

Despite not exactly being short on resources, these companies are eager to extend language coverage while paying as little as possible for time-consuming human transcription. With this comes the risk of AI colonialism, further marginalizing minority cultures and languages

To defend against this, it is important to engage community groups in the development of technology for their language. Te Hiku Media, a Māori radio station that worked with its community to develop impressive ASR for te reo, stresses the importance of data sovereignty for indigenous languages in particular, as formalized in their Kaitiakitanga License

If big tech is truly committed to working towards a more inclusive and responsible AI to preserve the richness of the world’s languages, this could be a good place to start.