AWS Launches New Transcribe Foundation Model Trained on Millions of Hours of Audio

Amazon Transcribe New Foundation Model

Amazon’s transcription tool, Amazon Transcribe, now supports more than 100 languages, thanks to a new foundation model-powered system for automatic speech recognition (ASR).

In a November 26, 2023 blog post, Amazon announced that its team trained the speech foundation model — an AI system, such as a large language model, that can be trained for specific tasks — using “best-in-class, self-supervised algorithms.” 

The system, which was trained on millions of hours of unlabeled audio data from more than 100 languages, uses “smart data sampling” to balance the proportions of training data across languages; the purpose is to achieve high accuracy for historically low-resource languages. 

According to the blog post, the speech foundation model has helped pay-as-you-go Amazon Transcribe improve 20-50% in accuracy across most languages.

Amazon primarily markets its ASR service as a way for users to automatically create captions and subtitles for their content, but it also offers more specialized versions for specific industries, including Amazon Transcribe Medical — a possible competitor in a space where dictation providers have recently struggled with data privacy.

A major speech-to-text (STT) competitor for Amazon Transcribe, meanwhile, is OpenAI’s Whisper, an ASR model for transcription and into-English translation introduced in September 2022. Workflow automation tool Zapier introduced a Whisper API in April 2023, broadening connectivity for Whisper with the low- and no-code tech ecosystem.

First released in November 2017, Amazon Transcribe quickly upgraded its offerings to support custom vocabularies in April 2018. Through the years the service has also grown to handle dozens of languages, from accented English (starting in November 2018) to more recent (2019) additions including Tamil, Gulf Arabic, and Swiss German. 

In September 2021, the service began to generate subtitles for video files, and in May 2022, Amazon unveiled batch language identification (that is, identifying more than one language in a single audio file). 

Speaking to Slator in January 2023, Happy Scribe CEO André Bastié shared that his company relies on a mix of DeepL and Google Translate to provide translation — tools that Amazon, presumably, might not need, given its massive stores of proprietary data and its own Amazon Translate tool. Data, however, is not the only issue, as Bastié pointed out. 

“You need a very deep language understanding to be able to do subtitles. You need to understand the structure of a sentence to know where to break it,” he explained. “Doing subtitles is easy. Doing subtitles that are readable is tough.”

To that end, Amazon Transcribe also reportedly improved readability, in connection with “more accurate punctuation and capitalization.” The newest iteration has expanded support for a range of accents, noise environments, and acoustic conditions. 

Despite these new capabilities, Tony Abrahams, CEO of live multilingual captioning provider Ai-Media, told Slator in April 2023 that ASR tools have certain limits that require humans to step in for especially complicated audio — perhaps 10% of a client’s audio.

“I think that percentage will shrink. But the reality of it is that while that percentage of content might only be 10%, it tends to be the most important content for our customers,” Abrahams said.