On September 21, 2022, Open AI released Whisper, an automatic speech recognition (ASR) system. Whisper approaches “human-level robustness and accuracy” on English speech recognition, according to Open AI. The model also performs multilingual transcription and into-English translation.
Available as open-source, Whisper can be installed and used immediately by AI researchers and developers; or accessed via browser with various demo interfaces that have sprung up, such as the one from Hugging Face.
Mainstream media reaction to the release was fairly muted compared to the fanfare that greeted Open AI’s text generator, GPT-3. But Whisper has caused ripples of speculation across the AI and ASR communities, who shared their hot takes on Twitter.
Is it a major breakthrough? How well does Whisper perform? What does the new ASR system mean for companies that have built their offering on speech recognition?
Whisper was trained in a novel way. Researchers did not use a carefully selected set of speech recordings with human transcriptions. Nor did they use a large volume of raw, untranscribed audio. Instead, researchers found a compromise between quality and quantity.
The training dataset was made up of 680,000 hours of English and multilingual audio-transcript pairs from the Internet. Transcription quality was patchy. Some effort was made to manually eliminate the most obvious issues but, overall, the “gold standard” of quality was relaxed in the interests of diversity and scale.
The trade-off was beneficial. According to the Open AI research paper, Whisper outperforms other fine-tuned ASR models when newly presented with broad and diverse data. In this setting, Whisper makes 55% less errors than other models, on average. Researchers concluded that Whisper’s performance in English speech recognition was “not perfect but very close to human-level accuracy.”
Breakthrough Quality? Yes…and No
Whisper has been greeted as a breakthrough in speech recognition by some commentators. The CTO of CTERA, Aron Brand, highlighted the system’s accuracy and breadth of language coverage and said Open AI was “changing the world.”
Whisper does appear to have an edge over other models for diverse datasets. Unlike some more finely-tuned, “brittle” models that have trouble with variation, Whisper performs robustly on data that includes a wide variety of speakers, accents, background noise, and technical terms.
On the other hand, Open AI’s researchers found that Whisper’s performance on a clean, curated benchmark dataset was unremarkable, compared to other models.
Performance also varied widely by language. Whisper can transcribe speech in 99 languages, and translate into English from multiple languages. Researchers found that (as they had anticipated) better performance correlated with higher volumes of language data.
Since Whisper’s release, various tweeters have reviewed Whisper’s capabilities. Some shared positive feedback on Whisper’s performance in Hindi as well as in minority languages such as Galician and Catalan. However, Whisper’s Bengali transcription was judged by one tweeter to be “way off the mark.”
Research Engineer at DeepMind, Aleksa Gordíc, demoed the system’s language capabilities by recording himself speaking in five languages and then using Whisper to automatically transcribe and translate the audio into English, noting “it only made a couple of mistakes.”
Andrej Karpathy, former Director of AI at Tesla, tweeted that he was “impressed” by Whisper. Karpathy used the system to transcribe a one minute snippet of one of his own lectures and found that the transcription was “perfect” apart from one error.
However, software engineer Ryan Hileman observed some “peculiar fail cases.” After investing 3,000 GPU hours to test the system against other models, Hileman concluded that Whisper is “very good at producing coherent speech, even when it is completely incorrect about what was said.”
This finding was also noted by Open AI’s researchers, who said that Whisper was capable of “hallucinations.” The system, they surmised, was mixing up its ability to predict the next word in the audio — based on its general language knowledge — with its transcription abilities.
Such flaws, in the view of machine learning scientist Leon Derczynski, come with the territory.
Sounds a bit like Whisper has managed to go for fluency at the cost of faithfulness, which I think given the state of neural ASR prior to its release was a laudable and reasonable tradeoff— Leon Derczynski 🌲🏔️ (@LeonDerczynski) September 28, 2022
Slator compared Whisper and Descript’s performance on a YouTube video of a pub conversation about the World Cup, which featured colloquial language and a noisy background. Both systems struggled to produce an accurate transcription. However, while Descript tended to omit unrecognized words, Whisper generated more creative guesses, particularly for proper nouns. (For example, proposing “I think it’s in a” for “Argentina,” and “the Irish on the roof” for “Thierry Henry.”)
New Paths & Possibilities
So, what’s significant about Whisper? The answer depends on your perspective. For some AI researchers, the most cogent point is that Open AI has validated a new approach to building speech recognition models.
“[Open AI’s] purpose is to show how scaling datasets in a weakly supervised fashion for supervised training is a better option,” tweeted Herumb Shandilya, a machine learning engineer. “Whisper certainly fulfilled what it aims to prove. This certainly opens paths for much more exploring on this. Hyped for the future!”
For others, the real game changer is the fact that Whisper is open source. High volumes of audio and video content can now be “unlocked.” Karpathy, for example, used Whisper to transcribe hundreds of podcast episodes. And fan-subtitlers of obscure movies may find themselves enjoying a new era of accessibility.
Just tried OpenAI whisper on a very old early German sound film (1930), captured from a not-so-good quality source. And while it didn’t catch all of the dialogue in the 30sec clip I tried, the lines it did capture were almost perfect. Gamechanger for fansubs of obscure movies! pic.twitter.com/EXu4mQnjXo— Johannes Baiter 👶 💻 (@jbaiter_) September 24, 2022
From Open AI’s perspective, there is potential for Whisper to improve accessibility tools. “While Whisper cannot be used for real-time transcription out of the box,” researchers said, “others may be able to build apps on top that allow for near-real-time speech recognition and translation.”
For Dubverse founder Varshul CW, it was Whisper’s multilingual capabilities — along with its accessibility and scalability — that sparked enthusiasm.
THIS IS HUGE— Varshul CW (@varshul_cw) September 22, 2022
Going multi-lingual becomes very accessible and scalable with @OpenAI whisper’s translate use-case
With this, any content can be converted from base language to English
99 base languages across Indian, Russian, Roman, African, Asian, and more are covered 💥 https://t.co/V813hrej2R
The video dubbing platform quickly released a Whisper-based transcription demo for users to try out.
And of course, once speech is converted to text, new possibilities flourish. Text can be analyzed, classified and transformed using natural language processing (NLP). Karpathy reflected on what such a future could look like in a tweet.
As someone who very much enjoys podcasts I continue to be frustrated that so much information is locked up in opaque audio files. How do we make all of this information accessible, searchable, navigable, linkable, upvotable, etc? Great opportunity if someone does this right, imo.— Andrej Karpathy (@karpathy) September 26, 2022
The Rising Tide of AI
With quality ASR now freely available, are companies that offer speech recognition now quaking in their boots?
Just like machine translation, speech recognition systems such as Whisper cannot be immediately deployed commercially in all contexts. Human intervention is often the solution for closing the “last mile” — the gap between what an AI model can produce and the quality needed by businesses. (In fact, Open AI’s Whisper paper found that an expert-in-the-loop approach outperformed both human-only and model-only transcription.)
Demand is likely to continue to grow for ASR providers that can help companies build custom speech recognition models and inject AI into their workflows, as well as for language service providers (LSPs) that provide expert-in-the-loop transcription and captioning as managed services (such as AI-Media and Verbit).
The easy availability of ever-improving ASR and NLP systems will also provide further impetus for startups to build SaaS platforms for specific use cases. Such platforms continue to attract significant startup funding, with examples including meeting solution platforms Airgram and Otter.AI, video editing tools Descript and Dubverse, automated subtitling platform XL8, and AI-dubbing platform Papercup.
So far, the release has not visibly impacted the conventional transcription industry. While shares in Australia-listed AI Media and Toronto-listed VIQ Solutions have shown weakness over the past few weeks, the downtrend was in line with the broader market and there was no particular drawdown in the days following the Whisper announcement.