How to Unlock More Value from Audiovisual Assets with Automatic Dubbing

AppTek Building Logo

Roughly 50% of the world’s internet users cite watching videos as their primary reason for going online. Given the demand for video content, it should come as no surprise that the demand for dubbing is also on the rise. This poses a major challenge for content producers, as dubbing is the most labor-intensive type of media localization.

Fortunately, there is an innovative solution that keeps getting better: automatic dubbing, and AppTek has just the right suite of products for every need. 

Like the name suggests, automatic dubbing is the process of automatically revoicing videos to reproduce the original experience in the native language of the target audience. Its purpose is to make video content accessible in additional languages at scale —at a fraction of the time and cost of traditional dubbing.  

AppTek is at the cutting edge of this technology, and is launching an AI-enabled, speaker-adaptive automatic dubbing technology that revolutionizes spoken translation. It allows content producers (is that you?) to create compelling experiences for international audiences, quickly and affordably.

For a sample of the technology, take a look at the demo below and note the automatically recognized, changing, and speaker-adaptive timed-translation of voices.

How AppTek Refined Automatic Dubbing Technology

Creating this state-of-the art technology was not without challenges. Significant R&D time and funds are invested to see how far AppTek can extend the quality that can be achieved through automatic dubbing. This includes the development of unique systems for the dubbing pipeline including: 

  • Adapted automated speech recognition (ASR) for higher quality source transcriptions that considers dialects, accents, domains, channels, and demographics.
  • Speech separation to separate speech from other audio and mute source speech elements while retaining other audio from the source media files.
  • Isometric machine translation (MT) to control the length of speech output, meet time constraints, and better match the target language to the source input.
  • Metadata-aware MT to further customize output, with controls for dialect, genre, formality, topic, gender, and more to better match translations to their source content.
  • Speaker diarization to segment speaker changes into time groupings.
  • Zero-shot speaker adapted text-to-speech (TTS) to offer revoicing into the target language using characteristics from the source speaker’s voice to make the translation sound similar to the original speaker.

Mattia Di Gangi, AppTek’s Lead Science Architect for Speech Translation, explains “Automatic dubbing at AppTek is an ever-evolving process with the goal of optimizing the collaboration between many high-quality machine learning models. Our models change over time, not only for quality but also to work with different information in input or in output. Moreover, our pipeline changes over time according to the information we obtain from the dubbed videos to improve the overall viewing experience.”

Automatic Dubbing Tiers and Use Cases

While the underlying ASR, MT and TTS technologies embedded in automatic dubbing have been in development and refinement for years, this combination presents a relatively new offering for audiovisual localization. 

As more markets begin to evaluate the technology, we have divided the approaches companies can take based on objective, budget, and the markets best served by each. The different tiers serve as a guide to where the technology may best be applied.   

Fully Automatic Dubbing

Stack: Adapted Automatic Speech Recognition with Speaker Diarization > Adapted Metadata-Informed MT with Isochrony > Speaker-Adaptive Speech Synthesis

Business case: There is a need for scalable speech translations, and there is a lack of available translator or dubbing resources and/or instances where budget is a constraint.

A fully automatic pipeline serves as a low-cost scalable solution to deliver more engaging experiences, as opposed to delivering only automated subtitles. While baseline ASR/MT models perform sufficiently for most general news, media and other content, higher accuracy can be achieved through adaptation of models fine-tuned to more specialized domain-specific content.

Best suited for: Content where the objective is to provide the consumer with a strong gist of what is being said while presenting it in a more immersive manner.

Sample markets include: Non-critical news, forms of user-generated content, and any form of general content that would improve audience reach by offering accessibility to foreign language speaking end-users.

A visible disclaimer stating “Translations and Speaker Dubbing Machine-Generated” is recommended in these instances. This informs users that automated systems are at use and there is potential for errors in the content they are watching.

Automatic Dubbing into Multiple Languages from Corrected Transcript

Stack: Existing or Corrected Source Language Transcript > Adapted Meta-Informed MT with Isochrony > Speaker-Adaptive Speech Synthesis

Business case: A corrected source transcript can be made available, but for budgetary reasons or a lack of translator availability, machine-translation is used along with speaker-adaptive speech synthesis to produce the output.

The machine translated output will improve by using a corrected source transcript, thus improving overall accuracy of the output to an acceptable level.

Best suited for: Content where this level of translation accuracy suffices and a source transcript is available. It is also well suited when a human-in-the-loop step is available to improve the machine translation. 

Sample markets include: Programs for which captioned files are customarily produced, such as news shows and media archives. Or, markets looking to expand content reach at a budget with an improved translation, for instance user-generated content such as cooking, travelogues, fitness or unboxing videos. 

When showcasing automatically dubbed videos, a visible disclaimer stating “Translations and Speaker Dubbing Machine-Generated” is recommended. This informs users that automated systems are at use and there is potential for errors in the content they are watching.

Automatic Dubbing from Corrected Translation

Stack: Existing or Corrected Translation > Speaker-Adaptive Speech Synthesis

Business case: An accurate translation is required or is readily available, but there are budgetary concerns for professional dubbing services, or a need for more efficient workflows.

An example of this includes SaaS-based instructional videos where the content may consistently change as the product does. Another example is a commercial where verbal mentions of pricing or unique ways of messaging are being tested, neither of which would be managed efficiently through multiple reshoots.

Best suited for: Content where 100% accuracy is required, such as legally or price sensitive information. Or, content that often requires changes and reshoots.

Sample markets include: E-learning, corporate communications, marketing content, and instructional videos.

Affordable automatic dubbing is available from AppTek for language services providers. This enables LSPs to generate incremental revenue and cast a wider net for localization opportunities as multimedia content grows exponentially.

Speech Synthesis-Adjusted Dubbing

Stack: Existing or Corrected Translation > Pre-trained and/or Manually Adjusted Speaker-Adaptive Speech Synthesis

Business case: Content where using synthetic speech is more cost-effective and/or efficient, and there is a range of emotional inflection in the dialog.

Training adapted speech synthesis models from emotions in the source and target languages can be used to produce more emotional inflection and prosody in the content. In situations where there are unique pronunciations, or where the speech synthesis does not meet the desired inflection, professional fine-tuning of phonetics, pitch and tempo can be deployed to further improve on the output.

Best suited for: Content where some range of emotion, or unique pronunciations or speech inflections are present.

Sample markets include: Forms of advertising, more emotional user-generated content, documentaries, telenovelas, and lower-budget films.

Market Opportunities and The Future of Automatic Dubbing  

To sum up, automatic dubbing enables brands to share a wider range of multilingual audiovisual content at lower costs. There are opportunities across many sectors. From the fully automated output servicing use cases such as user generated content, to gradually improving on the quality of the output via a human-in-the-loop step in any or all of the stages of the pipeline. 

Opportunities also lie in the mid-to-high tier markets from news and media archives, e-learning and corporate video spaces, to gaming and media localization for less-than-premium content. Brands that produce this type of content can use automatic dubbing to speed up production and increase ROI, especially as they won’t need to rely on voice actors alone.

Affordable automatic dubbing is available from AppTek for language services providers. This enables LSPs to generate incremental revenue and cast a wider net for localization opportunities as multimedia content grows exponentially.

Now is the perfect time to become an early adopter of automatic dubbing technology. 

Want to learn more about AppTek? Schedule a meeting today.