Localization, of course, has been key to Netflix’s success worldwide. But Netflix is also notable for introducing many monolingual English-speakers, via subtitles or dubbing, to non-English TV series and movies they might otherwise never encounter — including the recent Korean phenomenon Squid Game and, earlier, Money Heist (for which Netflix had to replace awkward English dubs back in 2019).
Now TikTok’s parent company, ByteDance, is in the game as well. An October 2021 paper, “Neural Dubber: Dubbing for Silent Videos According to Scripts,” explores whether synthesized human speech could come close to the “impressive ability of professional actors.”
In addition to ByteDance’s Qiao Tian, Yuping Wang, and Yuxuan Wang, contributors to the study included Chenxu Hu (Tsinghua University), Tingle Li (Shanghai Qi Zhi Institute), and Hang Zhao (affiliated with both Tsinghua University and Shanghai Qi Zhi Institute).
“Voice actors are remarkably capable of dubbing according to lines with proper prosody such as stress, intonation and rhythm, which allows their speech to be synchronized with the pre-recorded video,” the authors wrote. When it comes to automatic video dubbing (AVD), the synthesized speech needs to be consistent with both script and lip movement.
Text-to-speech (TTS) synthesis, which shares AVD’s goal of producing intelligible speech, is not enough to solve the problem alone because it uses only text as input, so the speech is unlikely to sync up with the video.
Neural Dubber, on the other hand, uses an image-based speaker embedding module that allows it to produce speech consistent with the speaker’s facial features (e.g., gender, age).
Arguably the most challenging part of AVD, the authors said, is aligning the video frames and phonemes from the original video. Neural Dubber’s text-video aligner allows the synchronized speech to match the lip movement in the video with the appropriate speed and emotional tone.
The researchers tested Neural Dubber’s performance using Amazon Mechanical Turk. Thirty video clips were selected at random from a single-speaker dataset (nine hours of chemistry lecture videos on YouTube) and a multi-speaker dataset (thousands of sentences spoken on BBC channels).
At least 20 native English-speaking raters scored each video clip based on audio quality and audio-visual synchronization. The team also measured the synchronization between audio and video quantitatively using two metrics: lip-sync error distance (LSE-D) and lip-sync error confidence (LSE-C).
Neural Dubber’s audio quality for single-speaker AVD is “on par” with TTS model FastSpeech 2, and actually outperforms FastSpeech 2 for multi-speaker AVD, “exhibiting the effectiveness of ISE” in this more challenging task.
“Voice actors are remarkably capable of dubbing according to lines with proper prosody such as stress, intonation and rhythm, which allows their speech to be synchronized with the pre-recorded video” — ByteDance-led study
To go one step further, the authors demonstrated more explicitly that ISE enables Neural Dubber to control timbre based on input face images. The researchers selected 10 images each for 12 males and 12 females, with slightly different details in each image (e.g., head posture, lighting, make-up, etc.).
The team observed a “distinctive discrepancy” between the speech Neural Dubber produced from the face images of different genders, and concluded that “Neural Dubber can use the face image to alter the timbre of the generated speech.”