According to its creator OpenAI, the automatic speech recognition (ASR) system, Whisper, approaches “human-level robustness and accuracy” for English speech recognition. The ASR, or speech-to-text (STT), system was released on September 21, 2022; the Whisper API followed in March 2023.
The open-source model is trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Whisper benefited from diversity and scale in its training on a large, varied dataset with imperfect transcription quality; only some effort was made to manually correct the biggest issues.
Whisper is capable of multilingual transcription and into-English single-shot translation without an intermediary step. The API allows developers to access cutting-edge translations and transcriptions in 50 languages. Although the model was originally trained on 98 languages, only those with a word error rate (WER) of under 50% (industry standard benchmark) are included.
According to OpenAI’s research paper, the pay-off between quality and quantity has been a success. Whisper outperforms other fine-tuned ASR models when presented with broad and diverse data with an average of 55% fewer errors. The system also performs robustly for data with multiple and varied speakers, and when there are accents, background noise, or technical terminology.
However, OpenAI states that Whisper’s performance on a clean benchmark dataset was unremarkable. The system does not outperform models specializing in LibriSpeech performance – a competitive benchmark in speech recognition – because Whisper is not trained on a specific dataset.
Additionally, performance varies by language. As is common, high-resource languages fare better than low-resource languages. Researchers also note that Whisper, like many large language models, experiences hallucinations. Some suggest this is unimportant and simply par for the course in this sphere.
ASR systems like Whisper cannot be used immediately for all use cases and human intervention is required. OpenAI’s paper concluded that for best-quality results, an expert-in-the-loop approach should be preferred.
Whisper Up Against It
The following are just a handful of the alternatives and how they compare to Whisper. It should be noted that some of these comparisons are not objective; research has been conducted by competitors.
Slator’s analysis of Whisper and Descript found that for a difficult video of a conversation with colloquial language and background noise, Whisper generated more creative guesses for unrecognizable words; Descript tended to simply omit unknown words.
In November 2022, captions.ai compared the transcription accuracy of Whisper and Google’s flagship STT API. They unveiled Whisper as the winner regarding accuracy with an almost perfect transcription of Eminem’s “Godzilla” – the song with the world record for the fastest rap in a no.1 single. Comparatively, Google’s STT API did not come close to transcribing it. Whisper also performed better with rapid speech and accents in several English locales.
On March 6, 2023, Google launched its Universal Speech Model (USM) with state-of-the-art multilingual ASR in over 100 languages and automatic speech translation (AST) capabilities for various datasets in multiple domains. Google found that USM achieved a lower WER than Whisper.
Two months later, Meta unveiled its Massively Multilingual Speech (MMS) project on May 22, 2023. Meta suggests these models “outperform existing models and cover 10 times as many languages” with labeled data for more than 1,100 languages and unlabeled data for 4,000 languages including some with only a few hundred native speakers. According to Meta AI, systems trained on MMS data had half the WER of Whisper. However, a direct comparison with Whisper on more widely used languages is required.
Finally, Deepgram produced a whitepaper benchmark report comparing its own capabilities with Whisper. Deepgram highlights the distinct variation in WER between “easy” audio and more challenging real-world audio. They conclude that “Deepgram offers higher accuracy, richer features, lower operating costs, faster processing speeds” and more. Indeed, Deepgram’s homepage states, “Innovators are switching from Whisper’s speech-to-text API to Deepgram to enable the future of intelligent voice applications”, although it is unknown how this is being measured.
Whisper in Action
Whisper has quickly appeared across the language landscape. Slator’s 2023 Language Industry Market Report demonstrates the expanding use cases for STT technology, including accessibility, engagement, and business analytics.
Captioning giant, AI Media, has made Whisper available via its cloud service, Happy Scribe’s transcription service is now based on Whisper with fine-tuning, and workflow automation tool Zapier now provides a Whisper connector.
Ramsri Goutham Golla told SlatorPod about his latest project, Supertranslate, a one-click subtitle app powered by Whisper, offering into-English subtitles without an intermediary translation engine.
Whisper’s significance is not limited to performance. Through Whisper, OpenAI has validated a new approach to building speech recognition models and Whisper being open source unlocks huge volumes of audio/video content. Open AI researchers have emphasized Whisper’s future potential. E.g., if others have the ability to ameliorate accessibility tools with the development of additional apps to “allow for near-real-time speech recognition and translation.”