Should I Use Whisper or Amazon Transcribe?

Whisper or Amazon Transcribe

With the proliferation of specific and multipurpose large language models (LLMs), questions about which services or tools are best suited for which tasks, and when, have only become more complex. This certainly applies to the wave of “helpers” emerging for automatic speech recognition (ASR) and its downstream uses. 

A simple pro/con list might not do justice to the capabilities of OpenAI’s Whisper or one of its main competitors, Amazon Transcribe — but it may serve as a decent starting point as users weigh their options. Here are some points to consider when deciding between the two.

And this is even before broaching slang and other specifics: Which language (or languages) need to be identified, transcribed, and/or translated? 

Amazon Transcribe, which launched its foundation-model powered system in November 2023, currently supports more than 100 spoken languages (up from 39 earlier in 2023). The company notably used “smart data sampling” to level the playing field for historically underrepresented languages. Ideally, this technique will give a boost to low-resource languages in terms of ASR quality, compared to traditional methods that see these languages lag behind others due to a lack of data.

Moreover, the speech foundation model has reportedly helped Amazon Transcribe improve accuracy between 20-50% in “most languages.” 

Whisper, an open-source AI model, supported transcription and translation of audio from 98 languages into English as of May 2023 — so slightly fewer than Amazon Transcribe.

Even Whisper’s GitHub page acknowledges that the system’s “performance varies widely depending on the language.” 

Verdict: Use Whisper for the languages in which it consistently performs well. Amazon Transcribe will likely be the better choice for low-resource languages.

Content

Any given tool’s suitability for a given task often comes down to content. Will the tool be able to handle specialized terminology, or demonstrate an understanding of implied context? Oftentimes, it depends on the training data.

Amazon Transcribe and Whisper seem to have been trained on data at a similar scale: Amazon Transcribe was trained on millions of hours of audio data from more than 100 languages. While Whisper started out with just 680,000 hours of supervised audio from the web, its latest iteration, Whisper-v3, has trained on five million total hours of audio.

More specifically, Whisper’s training data can be further broken down into one million hours of “weakly labeled audio” and four million hours of “pseudolabeled audio.” Amazon Transcribe’s data, meanwhile, was all unlabeled, though Amazon does offer more specialized versions, such as Amazon Transcribe Medical, for specific use cases.

Verdict: Whisper’s (at least partly) training data implies more human involvement, at least at that step, which could be helpful for more specialized content. Amazon Transcribe’s unsupervised data might be more suitable for general content. 

End Goal

Ultimately, most users (as opposed to researchers and developers) are focused on what they can get out of a tool — so it makes sense to begin with the end in mind. In other words, users should select their tool based on what they want to accomplish. 

Amazon Transcribe, first introduced in 2017, would appear to be the more “established” of the two, though its foundation model was only launched in November 2023. But as far back as 2019 Amazon was already encouraging end-users to use Amazon Transcribe for voice translation apps.

Amazon currently promotes Amazon Transcribe for automatically generating captions and subtitles, reportedly with improved readability via enhanced capitalization and punctuation.  

Comparatively new-on-the-scene Whisper released its own v3 in November 2023 — but not without some kinks. For all the critiques on GitHub, ranging from inconsistent improvement across languages to hallucinations and repetitions, end-users are discussing new possible use cases, such as air traffic control and real-time transcription for streaming audio or video.

Verdict: A draw, considering the very similar offerings; individual results may vary.