How Big of a Deal Is ‘Whisper’ for ASR and Multilingual Transcription?

Whisper Automatic Speech Recognition

On September 21, 2022, Open AI released Whisper, an automatic speech recognition (ASR) system. Whisper approaches “human-level robustness and accuracy” on English speech recognition, according to Open AI. The model also performs multilingual transcription and into-English translation.

Available as open-source, Whisper can be installed and used immediately by AI researchers and developers; or accessed via browser with various demo interfaces that have sprung up, such as the one from Hugging Face.

Mainstream media reaction to the release was fairly muted compared to the fanfare that greeted Open AI’s text generator, GPT-3. But Whisper has caused ripples of speculation across the AI and ASR communities, who shared their hot takes on Twitter.

Is it a major breakthrough? How well does Whisper perform? What does the new ASR system mean for companies that have built their offering on speech recognition?

Whisper was trained in a novel way. Researchers did not use a carefully selected set of speech recordings with human transcriptions. Nor did they use a large volume of raw, untranscribed audio. Instead, researchers found a compromise between quality and quantity.

The training dataset was made up of 680,000 hours of English and multilingual audio-transcript pairs from the Internet. Transcription quality was patchy. Some effort was made to manually eliminate the most obvious issues but, overall, the “gold standard” of quality was relaxed in the interests of diversity and scale. 

The trade-off was beneficial. According to the Open AI research paper, Whisper outperforms other fine-tuned ASR models when newly presented with broad and diverse data. In this setting, Whisper makes 55% less errors than other models, on average. Researchers concluded that Whisper’s performance in English speech recognition was “not perfect but very close to human-level accuracy.”

Breakthrough Quality? Yes…and No

Whisper has been greeted as a breakthrough in speech recognition by some commentators. The CTO of CTERA, Aron Brand, highlighted the system’s accuracy and breadth of language coverage and said Open AI was “changing the world.”

Whisper does appear to have an edge over other models for diverse datasets. Unlike some more finely-tuned, “brittle” models that have trouble with variation, Whisper performs robustly on data that includes a wide variety of speakers, accents, background noise, and technical terms.

On the other hand, Open AI’s researchers found that Whisper’s performance on a clean, curated benchmark dataset was unremarkable, compared to other models.

Performance also varied widely by language. Whisper can transcribe speech in 99 languages, and translate into English from multiple languages. Researchers found that (as they had anticipated) better performance correlated with higher volumes of language data.

SlatorCon Remote June 2024 | $ 180

SlatorCon Remote June 2024 | $ 180

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

Testing, Testing…

Since Whisper’s release, various tweeters have reviewed Whisper’s capabilities. Some shared positive feedback on Whisper’s performance in Hindi as well as in minority languages such as Galician and Catalan. However, Whisper’s Bengali transcription was judged by one tweeter to be “way off the mark.”

Research Engineer at DeepMind, Aleksa Gordíc, demoed the system’s language capabilities by recording himself speaking in five languages and then using Whisper to automatically transcribe and translate the audio into English, noting “it only made a couple of mistakes.”

Andrej Karpathy, former Director of AI at Tesla, tweeted that he was “impressed” by Whisper. Karpathy used the system to transcribe a one minute snippet of one of his own lectures and found that the transcription was “perfect” apart from one error.

However, software engineer Ryan Hileman observed some “peculiar fail cases.” After investing 3,000 GPU hours to test the system against other models, Hileman concluded that Whisper is “very good at producing coherent speech, even when it is completely incorrect about what was said.”

This finding was also noted by Open AI’s researchers, who said that Whisper was capable of “hallucinations.” The system, they surmised, was mixing up its ability to predict the next word in the audio — based on its general language knowledge — with its transcription abilities.

Such flaws, in the view of machine learning scientist Leon Derczynski, come with the territory.

Slator compared Whisper and Descript’s performance on a YouTube video of a pub conversation about the World Cup, which featured colloquial language and a noisy background. Both systems struggled to produce an accurate transcription. However, while Descript tended to omit unrecognized words, Whisper generated more creative guesses, particularly for proper nouns. (For example, proposing “I think it’s in a” for “Argentina,” and “the Irish on the roof” for “Thierry Henry.”)

New Paths & Possibilities

So, what’s significant about Whisper? The answer depends on your perspective. For some AI researchers, the most cogent point is that Open AI has validated a new approach to building speech recognition models. 

“[Open AI’s] purpose is to show how scaling datasets in a weakly supervised fashion for supervised training is a better option,” tweeted Herumb Shandilya, a machine learning engineer. “Whisper certainly fulfilled what it aims to prove. This certainly opens paths for much more exploring on this. Hyped for the future!”

For others, the real game changer is the fact that Whisper is open source. High volumes of audio and video content can now be “unlocked.” Karpathy, for example, used Whisper to transcribe hundreds of podcast episodes. And fan-subtitlers of obscure movies may find themselves enjoying a new era of accessibility.

From Open AI’s perspective, there is potential for Whisper to improve accessibility tools. “While Whisper cannot be used for real-time transcription out of the box,” researchers said, “others may be able to build apps on top that allow for near-real-time speech recognition and translation.”

For Dubverse founder Varshul CW, it was Whisper’s multilingual capabilities — along with its accessibility and scalability — that sparked enthusiasm. 

The video dubbing platform quickly released a Whisper-based transcription demo for users to try out.

And of course, once speech is converted to text, new possibilities flourish. Text can be analyzed, classified and transformed using natural language processing (NLP). Karpathy reflected on what such a future could look like in a tweet.

The Rising Tide of AI

With quality ASR now freely available, are companies that offer speech recognition now quaking in their boots?

Not at all, according to Scott Stephenson, CEO of ASR company Deepgram, who answered online speculation with a tweet: “OpenAI + Deepgram is all good — rising tide lifts all boats.”

Just like machine translation, speech recognition systems such as Whisper cannot be immediately deployed commercially in all contexts. Human intervention is often the solution for closing the “last mile” — the gap between what an AI model can produce and the quality needed by businesses. (In fact, Open AI’s Whisper paper found that an expert-in-the-loop approach outperformed both human-only and model-only transcription.)

Demand is likely to continue to grow for ASR providers that can help companies build custom speech recognition models and inject AI into their workflows, as well as for language service providers (LSPs) that provide expert-in-the-loop transcription and captioning as managed services (such as AI-Media and Verbit).

The easy availability of ever-improving ASR and NLP systems will also provide further impetus for startups to build SaaS platforms for specific use cases. Such platforms continue to attract significant startup funding, with examples including meeting solution platforms Airgram and Otter.AI, video editing tools Descript and Dubverse, automated subtitling platform XL8, and AI-dubbing platform Papercup.

So far, the release has not visibly impacted the conventional transcription industry. While shares in Australia-listed AI Media and Toronto-listed VIQ Solutions have shown weakness over the past few weeks, the downtrend was in line with the broader market and there was no particular drawdown in the days following the Whisper announcement.

Subscribe to Slator’s Growth package to access a comprehensive NLP Company List.