What is Automatic Speech Recognition?

What is Automatic Speech Recognition

Automatic Speech Recognition (ASR) are technologies that process human speech and convert it into text. After first appearing as early as 1952 with Bell Labs’ Audrey system, which could only comprehend digits, ASR has now become part of our everyday lives with the likes of voice search, virtual assistants, live captioning, and clinical note-taking.

There are two main approaches to ASR. In the traditional hybrid approach, ASR systems comprise separately trained acoustic, pronunciation, and language model components. Conversely, in the end-to-end approach, the system can learn the acoustic, language, and pronunciation information contained in speech and then directly maps a sequence of input acoustic features into a sequence of words.

Over recent years the pace of ASR development has been rapid. In January 2021, Facebook AI produced a new large-scale open-source dataset, Multilingual LibriSpeech (MLS), expanding on LibriSpeech with seven new languages in addition to English. Facebook AI hoped this would promote “open and collaborative research in multilingual ASR and improve speech recognition systems in more languages”. 

Subsequently, OpenAI released its open-source ASR system, Whisper, in September 2022. The system is capable of speech recognition, multilingual transcription for 99 languages, and into-English translation from several languages.

Training this system prioritized diversity and scale over perfect quality. OpenAI claimed it is approaching “human-level robustness and accuracy” for English speech recognition. Whisper presented an average of 55% less errors outperforming other models when confronted with a variety of speakers, accents, technical terms, and background noise (OpenAI research paper). André Bastié, CEO of Happy Scribe, described Whisper as being “a big change of paradigm in the AI sphere. First, it is showing that having a lot of data makes a difference. Second, it is the multilingual aspect that is impressive.”

Only one month later, Google announced its 1,000 Languages Initiative to develop an AI model supporting the world’s 1,000 most widely spoken languages. This included the development of the Universal Speech Model (USM). It can perform ASR in over 100 languages, including under-resourced languages, and achieved a lower word error rate compared to other public pipelines, like Whisper.

Current Challenges

According to Giuseppe Daniele Falavigna, Senior Researcher at Fondazione Bruno Kessler (FBK), and Marco Turchi, Head of the Machine Translation group at FBK, ASR has experienced a revolution since the emergence of deep neural networks. However, there are still issues with homophones, code-switching, speed variability, volume, tone, and background noise, etc.

One of the biggest challenges for ASR and speech translation (ST) is handling unknown personal names in unfamiliar languages since models trained on English audio try to make every sound match English words. The accuracy rate of transcription and translation for personal names lies at around 40%.

One solution is to incorporate audio from another language into the training data which improved results by an average of 48%. Another solution is including referent names: the more often a name appeared, the higher the likelihood that the system could correctly transcribe it.

André Bastié explained two problems on SlatorPod. The first is, “how do you build a model that is aware of the current news?” and they are trying to provide “an AI that is aware of the latest jargon”. When a major world event occurs, current models cannot cope and users have to carefully proofread the output since the terms and concepts are new and unknown. E.g., COVID. The second challenge is bias. As Bastié explains, “these models are still quite biased towards men and towards not understanding minority accents”.

ASR Applications in Translation, Post-Editing, and Subtitling

ASR technologies could be a useful tool to enhance performance in several services offered by the language industry. Translators, subtitlers, post-editors, and project managers who specialize in workflows with ASR and MT could find themselves in high demand. Additionally, ASR could make the translation industry accessible for blind and visually impaired individuals.

ASR was found to increase typing speed from 40 to 150 words per minute, which could accelerate multiple aspects of the industry, including translation and post-editing. Using ASR for post-editing proved more ergonomic. Voice input could redeem post-editing as it adds another dimension, making it more interesting. Translators can combine or swap between input modes depending on task difficulty and the changing conditions of human-computer interaction.

ASR is particularly useful for translators for remedying literalness issues incurred by using computer-assisted translation (CAT) tools and MT. It was also seen to be useful for web searches, drafting emails, and alleviating ailments, like repetitive strain injury, eye strain, and back pain. ASR systems have already been integrated into some commercially available CAT tools. E.g., memoQ with Apple’s speech recognition service and Matecat with Google Voice. The caveat is that professionals must have translation experience before incorporating ASR and ASR-generated translations must be checked closely.

New developments in ASR and MT have improved the efficiency of the subtitling process, which for two decades, involved two stages: transcription and translation. ASR or MT engines trained on subtitling data also produce better-quality subtitles.

Proof of the commercial impact of ASR in multilingual subtitling is Super Agency — and number one in Slator’s 2023 Language Service Provider IndexTransPerfect’s deployment of AppTek’s ASR technology to speed up their subtitling workflows.