Large Amazon Study on Human Dubbing Has ‘Surprising’ Implications for AI Dubbing

Amazon Prime Video Dubbing

Examining every Amazon-produced TV show available on Prime Video at the end of 2021 for which, crucially, there was a hand-curated transcript (English shows) or dubbing script (dubbed shows),” Amazon scientists came to this conclusion:

“Human dubbers display less respect for isochrony and especially lip sync than is suggested by qualitative literature, while being surprisingly unwilling to vary speaking rates or sacrifice translation quality to hit other constraints,” most of which have to do with closely matching the original video track.

In a paper published on December 23, 2022, William Brannon, Yogesh Virkar, and Brian Thompson, detail how they investigated a dataset comprising 319.57 hours of content with 9,215 distinct speakers taken from 674 episodes of 54 shows. (Now at MIT Media Lab, Brannon interned at Amazon Web Services in 2022, while Virkar and Thompson currently work at AWS AI Labs as Applied Scientists.)

All shows were originally recorded in English. Where available, the authors acquired audio and video for the English originals and audio tracks for the Spanish and German dubs. “Much of our analysis relies on a subset of 35.68 hours of content with both Spanish and German dubs,” they said.

The authors highlight how the results of the study “challenge a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automatic dubbing.”

Furthermore, the results suggest the importance of two aspects: first, vocal naturalness and translation quality over commonly emphasized isometric (character length) and lip-sync constraints; second, a more qualified view of the significance of isochronic (timing) constraints.

According to the authors, source-side audio has a substantial influence on human dubs through channels other than the translated word. This indicates the need for research into automatic dubbing (aka machine dubbing or AI dubbing) systems; most notably, they said, on how to preserve speech characteristics and semantic transfer such as emphasis or emotion.

Product, Not Process

The Amazon scientists examined human dubbing not by studying its process, but its product: a large set of actual dubbed dialogues from TV shows. Compared to interviews with dubbers, they noted how their approach had “the particular virtue of capturing tacit knowledge brought to bear in the human dubbing process but difficult to write down or explain.”

They were especially curious about how human dubbers balanced several competing interests: semantic fidelity, natural speech, timing constraints, and (convincing) lip-sync. The following factors were considered:

  • Isochrony – Do dubbers respect timing constraints imposed by the video and original audio?
  • Isometry – Do the original and dub texts have approximately the same number of characters?
  • Speech tempo – How much do voice actors vary their speaking rates, possibly compromising speech naturalness, to meet timing constraints?
  • Lip sync – How closely do the voice actors’ words match visible mouth movements of the original actors?
  • Translation quality – How much will dubbers reduce translation accuracy (i.e., adequacy and fluency) to meet other constraints?
  • Source influence – Do source speech traits influence the target in ways not mediated by the words of the dub, indicating emotion transfer?

The study focused on two language pairs (EN-DE; EN-ES). In future work, the authors hope to analyze more distant language pairs, such as English–Chinese or English–Arabic, as well as non-English source material.

The authors pointed out, “Our analysis has shown that isometry is a poor proxy for isochrony in human dubs, yet several prior works have claimed that isometric MT benefits automatic dubbing. In future work, we hope to perform analysis to understand this discrepancy.”