Automatic Dubbing: Amazon AI Researchers Explore Length-Controlled MT

Amazon AI Researchers Explore Automatic Dubbing and MT

Jeff Bezos went viral in early October 2021. The Amazon founder congratulated rival streaming service Netflix on their “impressive and inspiring” internationalization strategy in a tweet about the worldwide success of the hit Korean series, Squid Game.

Meanwhile, Amazon (home to Prime Video) has been busy working on its own internationalization efforts. A team of Amazon AI researchers recently delved deeper into automatic dubbing (AD) and machine translation (MT).

The resulting paper, “Machine Translation Verbosity Control for Automatic Dubbing,” was published on pre-print platform arXiv on October 8, 2021. The authors are a collective of Scientists and Engineers from Amazon AI, a unit of Amazon World Services. Among them is Marcello Federico, who in addition to leading prior machine dubbing efforts for Amazon is also Co-founder of translation productivity tool MateCat.

The research focuses on the “problem of controlling the verbosity of machine translation output” with the aim of generating better-quality automatic dubbing. In this context, verbosity relates to length; that is, the authors want to control MT output length for use in dubbing.

They explained: “Automatic dubbing aims at seamlessly replacing the speech in a video document with synthetic speech in a different language.” Doing so is complex, because translations must reflect the meaning of the original and match the length of the original.

The experiments involved French, Italian, German, and Spanish content, machine translated from an English transcript, and controlled the number of characters in the MT output as a proxy for controlling the dubbing duration.

Better Translations, Worse Dubbing?

According to the researchers, the experiments used intrinsic and extrinsic evaluations, a significant difference from previous work: “Intrinsic evaluations measure MT quality and verbosity with respect to human post-edited translations matching length requirements, while extrinsic evaluations measure subjective quality of video clips dubbed by using the generated translations.”

MT performance was measured according to the BLEU score. To measure verbosity, the researchers counted the percentage of MT outputs that matched the length of the original with a tolerance of +/-10%, which the researchers said they “consider acceptable for AD.”

“AD tries to automate the localization of audiovisual content, a complex and demanding workflow managed during post-production by dubbing studios.”

Meanwhile, for the subjective evaluations, the researchers generated dubbed videos in Italian and German, and asked 40 subjects to rate their viewing experience.

In terms of MT quality, the researchers concluded that “our resulting best model not only produces translations much closer in length to the input, but often also better in translations” when compared to a standard Transformer MT model trained without verbosity information.

However, they said, the subjective evaluation of automatically dubbed videos, which used the MT-generated translations, both with and without verbosity control, confirmed an “increase in human preference for videos dubbed with the latter version” (i.e., without verbosity control).

SlatorCon Remote June 2023 | Early Bird Now $120

SlatorCon Remote June 2023 | Early Bird Now $120

A rich online conference which brings together our research and network of industry leaders.

Register Now

Automating the Work of Dubbing Studios

The paper noted, “AD tries to automate the localization of audiovisual content, a complex and demanding workflow managed during post-production by dubbing studios.” Dubbing workflows are indeed complex, involving multiple human stages and creative collaboration

In current professional workflows, translators or post-editors are responsible for ensuring that the original meaning is preserved, while adapters and voice artists typically take care of length-matching and lip-synching.

Although professional dubbing is normally synonymous with lip-sync dubbing (with synchronized lip movements), the research “only” aimed to achieve synchronization at the utterance level and did not concern itself with lip or body movement synchrony. This is the case with most work on automatic dubbing, the researchers said.

Not only are professional dubbing workflows, as yet, unrivalled by current-state automatic dubbing, many dubbing studios are seeing increased levels of demand. As streaming services such as Netflix and Amazon continue to localize their English-language content to drive subscriptions worldwide, many are now also doubling down on bringing international content to English-speaking audiences