2 weeks ago
November 10, 2020
DeepMind Says Impact of New End-to-End Machine Dubbing Tech May Be Widespread
As the world increasingly relies on the medium of video for entertainment, communication, and education, a tech company collab has set its sights on making educational videos accessible to a wider audience.
Google and DeepMind’s system for dubbing educational videos was detailed in the research paper, “Large-scale multilingual audiovisual dubbing,” which was published on pre-print platform arXiv on November 6, 2020.
Co-authors are Yi Yang, Brendan Shillingford, Yannis Assael, Miaosen Wang, Wendi Liu, Yutian Chen, Eren Sezener, Luis C. Cobo, Misha Denil, Yusuf Aytar, and Nando de Freitas of AI company and research lab DeepMind — which was acquired by Google in 2014 — and Yu Zhang from Google.
The team has an impressive track record, which spans institutions such as Oxford University, Cambridge University, Stanford University, and others. Their previous employers include Baidu, Microsoft, KAYAK, Amazon, and Google, to name a few.
What’s novel about the DeepMind / Google project is that the researchers are not only concerned with translating audio content (speech), but also focus on adapting visual content. As explained in the paper, “We extend audio-only dubbing to include a visual dubbing component that translates the lip movements of speakers to match the phonemes of the translated audio.”
This involves modifying the speaker’s on-screen facial expressions (especially the lip movements) so that they match the target language, which the researchers said “creates a more natural viewing experience in the target language.”
This end-to-end workflow is complex and involves a variety of different subsystems.
- Automatic speech recognition (ASR) – video transcription and sentence identification followed by manual correction;
- Machine translation (MT) – followed by manual correction;
- Speech synthesis with voice imitation – synthetic voicing of the translated text to sound like the speaker’s voice; and
- Lip movement synthesis – alteration of on-screen images to match the translated audio.
SlatorCon Remote returns on December 3, 2020, featuring the best of our proprietary research and network of language industry leaders.
The researchers used more than 3,700 hours of transcribed video in 20 languages to feed the generic models, which were then “fine-tuned to a specific speaker before translation.”
Deep Fakes and Consent Issues
The purported goal of the research was to address the “imbalances in information access and online education” that exist because, according to the paper, nearly 60% of Internet content is published in English, while just a quarter of Internet users are native English speakers.
Crucially, the researchers said, “automatic translation of educational videos offers an important avenue for improving online education and diversity in many fields of technology.”
Yet, there are still a number of shortcomings that prevent the system from being fully automated and mean that its application is, for now, limited. The researchers identified several such challenges, including:
- Idiomatic speech: MT cannot reliably deal with idiom, which means that humans are required to edit the MT output (in addition to the ASR output);
- Multiple speakers: the system does not perform well in situations where the person speaking changes frequently, and where voices may run overlap (e.g., video interviews);
- Sentence length: text can expand or contract when translated, and it needs to be modified by a human so that the length of the translated audio matches the original.
Despite its current limitations, the work “could have a widespread impact across sectors, from education to entertainment and gaming,” the researchers said, pointing out that “the general nature of this technology means it could be applied in many different settings.”
The researchers are mindful of the potential dangers of such video-altering capabilities, which primarily relate to deep fakes and consent issues. They acknowledged that “improving the ability to lip sync means it could be possible to ‘puppet’ an individual’s face using a voice actor’s speech, or other speech not spoken by that person, to generate deep fake content.”
The paper also noted that, “consent was retrieved from the source video owners of the translated videos shown with this work, and all video content generated via our system contains visible watermarks, so viewers are aware of any synthetic content displayed.”
Their demo videos are available to watch on YouTube.