Here Is How Amazon Wants to Improve Machine Dubbing

Amazon Machine Dubbing 2023

In December 2022, a group of Amazon researchers published the results of their efforts to study human dubbing, a project for which they used a dataset comprising samples from 54 shows in three languages. The main finding of that study was that humans prioritize quality over timing. 

What about machine dubbing (aka AI dubbing or automatic dubbing)? It turns out it also performs poorly on speech timing between source and target. While machine translation (MT) engines are trained on massive amounts of human output, given the ubiquitous publicly available sources of human text, there is still little in the public domain that can be used to train automated speech dubbing systems.

On February 25, 2023, a group of Amazon researchers (Alexandra Chronopoulou, Brian Thompson, Prashant Mathur, Yogesh Virkar, Surafel M. Lakew, and Marcello Federico at AWS) published a paper acknowledging this limitation and proposing a new way to train automated dubbing systems for better speech alignment and translation quality. Side note: Federico was the CEO and co-Founder of ModernMT, a machine translation company now part of Translated.

To date, the typical way to achieve at least an approximation on timing between source and automatically translated speech has been a combination of two approaches: Text translation is synthesized into speech of a similar duration to the source speech, and generated speech is manipulated to compress or expand it and to add pauses.

In the paper, the researchers set out to optimize speech alignment/timing and MT quality at the same time. To that end, they designed a method using a flexible model that adjusts translation to align with speech duration and the other way around: speech duration adjusts to translation.

MT Made Just for Dubbing?

AWS researchers propose creating a special MT model for automatic dubbing, a sort of “duration aware” model.

To produce such an MT model, the expectation is that the speech duration of each word in translation would match the length of both source and target speeches. The researchers claim their approach is different from previous research because the predicted duration in their experiments focuses on individual phonemes, not full words.

Given the scarcity of dubbing data, the model was trained on speech translation data. This data included source texts as input, and target speeches and transcripts as the output (again, based on phonemes, not words), including pauses. Timing was determined by the source speech and the model was prompted to override natural duration and match the target timing (i.e., forced alignment).

Experimenting with German and English

German has notoriously long words, so using phonemes as the basis for aligning with English, regardless of direction, makes a lot of sense. The researchers used the English-German pair data available within a training set called “COVOST-2.” The data included English speech and the corresponding transcripts, the German text translations, and then the same elements in the opposite direction (German as source and English as target).

Random samples included two test subsets, one with 91 sentences (test91), and a second one with 101 sentences (test101), with no overlap between the two subsets. For the test101 subset, each German sentence data was annotated with at least one natural (native human) pause.

German-speaking volunteers then recorded a clip for each sentence in the two subsets with precise but different instructions as to where/when to pause for each set. Humans also participated in the evaluation of the two subsets.

The main metric used to assess translation quality was BLEU. Timing was measured using a metric aptly called “Speech Overlap.” 

The adjustments made to the source speech duration resulted in better translation quality, and reduced speech overlap. 

“We expect that the best dubbing speech is not necessarily obtained from the best stand-alone translation; instead, the generated output needs to both be similar in terms of content with the source but also match the pauses and prosodic structure of the source speech,” stated the researchers in the paper.

The researchers made the resulting dubbing test set available for further experimentation.