AI dubbing is good enough for certain contexts and certainly has its uses, but it is not perfect (yet). Currently, the level of automation or human involvement leads to different levels of quality.
AI dubbing — also known as machine dubbing or automatic dubbing — is the process of replacing a human voice with a synthesized voice using text-to-speech, or “cloning” the tonal qualities of a human voice and applying those same qualities to another human’s voice (also called speech-to-speech or re-voicing). The aim is to make video content more accessible in additional languages and replicate the original multimedia experience in a different language.
Demand for dubbing is on the rise; roughly half of the world’s internet users cite watching videos as the main reason for being online. Netflix has revealed that dubbed shows gained more traction than the subtitled versions and that they have been increasing their dubbing investment by 25-35% annually.
How does AI Dubbing Work?
The AI dubbing workflow incorporates automatic speech recognition (ASR, or speech-to-text, STT) of the source language, auto-alignment of the text on a timeline, neural machine translation (NMT) into the desired target language, and speech synthesis to produce the voice used in the target language for the localized version.
The human-in-the-loop workflow for voiceover using text-to-speech requires a human to manipulate the text and to “prompt” the voice engine to generate the most natural-sounding synthetic voice. Synthesized voices essentially completely replace the need for voice talent or recording studios.
Equally, for lip-sync using speech-to-speech or re-voicing, the AI model must be trained by human voice samples. A voice actor performs in a recording studio to achieve lip-sync and necessary emotions.
AI Dubbing in Use
Automated techniques are ideal for dubbing off-screen speakers or narrators (automatic voiceover) as lip-sync becomes a non-issue. The technology can be implemented across many sectors, such as user-generated content, news, e-learning, corporate videos, and gaming and media localization.
Currently, AI dubbing is less suited for dramatic content; film and TV clients appear to be less willing to move away from traditional dubbing. Yet, the CEO of NeuralGarage, Mandar Natekar, sees OTT platforms (e.g., Netflix, YouTube), studios, broadcast networks, and ad agencies as potential clients.
AI dubbing can help to mitigate the challenge of rising demands for dubbing, reduce labor needs, alleviate issues around using child voice actors, lower costs, and ensure consistency of a character’s voice across all languages thanks to speech-to-speech or re-voicing.
Good, but not Good Enough
Apptek’s Volker Steinbiss named the extent to which AI dubbing can be implemented and its limitations. He clarified, “At this point in time, automatic dubbing is mostly for making available large amounts of content that otherwise wouldn’t be dubbed in many languages”. So while AI dubbing may not be the first choice for media localization, it can be used to scale the availability of content.
Steinbiss explained that the more automation involved, the poorer the quality so fully automated dubbing processes should only be used for noncritical content or content with a short life cycle.
Deepdub CEO, Ofir Krakowski, agrees that humans are still needed to create high-quality output. Deepdub’s workflow incorporates humans at various stages so its “platform constantly learns and improves through this iterative process.”
So, what constitutes a high-quality dub? Manel Carreras of EVA Group outlined that if a dub is of high quality, it will not be noticed by the viewer; “A good dub shouldn’t be a distraction.”
Attaining this level of quality requires successful translation, script adaptation, voice acting, and audio mixing. Quality is affected by the timing and visual synchronization between the on-screen actors’ mouths and the voice actors’ words. However, Amazon’s study of human vs machine dubbing concluded that humans prioritize the quality of the dubs over timing; similarly, Papercup found that, based on market demands, “lip-syncing seems to be quite low on the list”.
The sound of the AI dubbed voices — in terms of naturalness, human-likeness, and level of emotion — also contributes to overall quality. Emotion is possibly “the most challenging aspect” of AI dubbing, according to Dubverse’s Anuja Dhawan. Expressing emotion is not just about the words, but also how they are said. Voiseed CEO, Andrea Ballista, explained, “the way [emotions] are conveyed, in terms of culture and vocal apparatus, are a little bit different” so “the emotional delivery of the line has to be… remapped into another language.”
The Crystal Ball for the Future of AI Dubbing
The technology is constantly improving and advancements are already appearing on the scene to help refine the process and enhance the quality of AI dubbing.
- Amazon’s new “duration aware” model is a method for training automated dubbing systems which adjusts translations to align with speech duration, and vice versa.
- Deepfake dubbing and AI deepfake technology can be used to adjust video and audio content to help solve lip-sync issues. Likewise, NeuralGarage uses a machine learning framework, General Adversarial Networks, to “transform the lip and jaw movements to match the speech irrespective of language”.
- Post-editing synthetic voices makes speech more emotionally engaging.
- Speech-to-speech can be used to enhance TTS capabilities and provide emotion.
- Combination of TTS and text-to-video (TTV) to offer multi-language, synthesized voices.
The future seems bright for AI dubbing with revenues for the global automated dubbing sphere expected to rise from USD 117m to almost USD 190m by 2030. The Slator 2022 Language Industry M&A and Funding Report demonstrated the growing popularity of speech technology applications among investors.
According to the Slator 2023 Market Report, AI dubbing will rapidly advance “as voice synthesis companies proliferate”. Predicted future trends include, (1) increased adoption of AI dubbing as the amount of voice data increases, (2) AI voices will become almost “indiscernible from human voices” for most content types, (3) the tools for prompting voice models will become more intricate.