2 months ago
January 24, 2020
Machine Dubbing: Amazon AI Opens New Chapter in Automating Media Localization
Cloud-dubbing is so 2019. Machine translation researchers are exploring a new frontier in natural language processing, machine dubbing. A January 2020 research paper by a team at Amazon AI, a unit of Amazon Web Services (AWS), explored new techniques to make automatic dubbing appear more natural.
The team was led by Marcello Federico, whose résumé includes being a co-founder of translation productivity tool MateCat and contributions to Translated.net’s ModernMT (MMT), an open source machine translation (MT) software designed for the translation industry. In September 2018, Federico joined AWS as Principal Applied Scientist.
As the paper pointed out, the demand for automatic dubbing could be huge, as streaming platforms — including Netflix, Asian brand HOOQ, and Amazon’s own Prime Video — are producing and making available online, more content than ever before.
Traditional dubbing, however, is expensive. At a localization roundtable organized by media industry business network DPP one participant highlighted that “one of the top programming costs internationally is localisation. The cost of voice dubbing in particular can be very significant. Particular talent in particular markets can be the key to achieving engagement.”
Blazing the Trail
Automatic dubbing, which can be seen as an extension of speech-to-speech translation (STST), aims to replace all speech in a video with speech in a different language, while maintaining as natural a look and sound as possible. STST has emerged as a new focus of the machine translation research community with the likes of Google launching a system called Translatotron and Microsoft showing off what they claimed was direct STST at its Microsoft Inspire 2019 conference.
The researchers believe that theirs “is the first work on automatic dubbing that integrates enhanced deep learning models” for machine translation (MT), text-to-speech (TTS), and audio rendering, evaluating them on real-world videos.”
The researchers selected 24 video clips from six TED talks in English. Each clip lasted around 10 to 15 seconds, included only one speaker saying at least two sentences, and the speaker’s face was mostly visible during the clip.
The team manipulated their MT model to generate Italian sentences of a desired length based on timing restrictions. (Of course, the team used Amazon Translate as its baseline MT system.)
Utterance by utterance, the team ensured that the duration of the Italian TTS output matched that of the original English audio. They also performed “audio rendering,” extracting background audio (everything except for speech) from the original videos and then adding it back into the dubbed versions to make them sound more natural.
According to ratings of three versions of the video clips by 14 volunteers (five Italian speakers and nine non-Italian speakers), comprehension of content does impact a listener’s perception of automatic dubbing quality. Italian-speakers tended to rank system A (which featured only STST) as the best version, whereas non-Italian speakers found system A to be the worst.
Too Slow, Too Fast, or Too Uneven
“The comments left by the Italian listeners [indicate] that the main problem of system B [STST plus enhanced MT and prosodic alignment] is the unnaturalness of the speaking rate, i.e. it is either too slow, too fast, or too uneven,” the researchers wrote.
By contrast, the non-Italian speakers showed a statistically significant preference for system C [system B plus audio rendering], which the researchers interpret as demonstrating “the relevance of including para-linguist aspects (i.e. applause, audience laughs in jokes,etc.) and acoustic conditions (i.e. reverberation, ambient noise, etc.).”
Based on the feedback, the researchers expect that their future work will include “computing better segmentation and introducing more flexible lip synchronization.”