Sony’s DubWise Uses Visual Cues from the Video to Improve AI Dubbing

DubWise Video-Guided Speech Duration

In a June 13, 2024 paper, researchers from Sony Research India and the Indraprastha Institute of Information Technology (IIIT) introduced DubWise, a method designed to synchronize dubbed audio with the visual content of a video. 

DubWise employs a multimodal approach that combines large language models (LLMs) for text-to-speech (TTS) ​​with visual cues from the video. As the researchers noted, “videos provide more reliable guidance than audio for alignment.”

This allows the system to not only translate the dialogue but also control the duration of the translated speech to ensure it matches the lip movements and timing of the original video.

The system first uses an LLM to generate the translated text and then employs a duration prediction model that takes into account both the text and visual cues from the video, such as the speaker’s lip movements and facial expressions.

The researchers chose GPT-2 for multilingual TTS due to its smaller model size and wider adaptability in state-of-the-art TTS systems.

“Our method utilizes visual cues extracted from the video to achieve duration controllability in GPT-based TTS while maintaining intelligibility and speech quality,” they said.

According to the researchers, DubWise can address the challenging problem of audio-visual alignment after dubbing. They explained that traditional AI dubbing technologies often fail to align dubbed audio with the video, leading to unnatural audio-visual synchronization. This misalignment occurs because TTS-generated speech in the target language often has a different length than the original audio, they added.

First-of-its-Kind Attempt

“This is the first attempt of its kind that utilizes video-based modality for achieving duration controllability in […] LLM-based multimodal TTS,” the researchers stated.

They conducted experiments in both single-speaker and multi-speaker scenarios and used various metrics to evaluate duration control, intelligibility, and lip-sync accuracy.

The researchers say that DubWise outperforms other state-of-the-art methods across various metrics. It achieved improved lip synchronization and naturalness in both same-language and cross-lingual scenarios while maintaining speech intelligibility and quality.

Demo samples are available at https://nirmesh-sony.github.io/DubWise/ 

Authors: Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah