AppTek on How Content Creators Leverage Automatic and Adaptive Speech Dubbing

During the Lead Partner Presentation at SlatorCon Remote March 2023, AppTek’s Managing Director, Volker Steinbiss, informed and also entertained attendees with a demo of the company’s AI-enabled, speaker-adaptive fully automatic AI speech dubbing technology.

Steinbiss began his presentation by showing the technology’s capacity for “revoicing without lip sync.” The first process related to audio processing only and was demonstrated with a video clip showing two scientists speaking in multiple languages about the James Webb telescope. The original video is in US English and was machine-dubbed into German, Italian, French, Polish, Dutch, and Brazilian Portuguese.

Steinbiss asked attendees to pay attention to similarities between the original voices and the synthetic voices, and stressed that this was a fully automatic sample, created without manual intervention, refinement or editing, “so you get a realistic experience.” The synthetic voice was meant to mimic the original voice’s emotion and tone.

Steinbiss also played a scene from the famous English language movie “Casablanca” dubbed automatically into German. He explained that generally there is an audio mix of the actor speaking, complete with background music and noises.

These are some representative steps followed to arrive at a dubbed clip:

Source separation

The system splits the audio into speech and nonspeech in the original. The speech components are captured and the voices are isolated. The background audio is saved for later. 

Automatic speech recognition

Automatic speech recognition captures the text from the spoken audio. An extra step is required to add punctuation (for machine translation). 

Speaker diarization

To know who is speaking and at what moment, the speaker diarization function creates segments for each speaker, and identifies different speakers for text-to-speech or synthetic speech. It also helps improve the translation of the automatic transcription. 

Meta-aware machine translation

Metadata, including characteristics such as gender of speakers, is important for the target languages, as is the use of formal or informal tone. Elements like time constraints are also key to the final output: if the translation is too long, speeding up the audio would create unnatural speech. Therefore, one of several possible correct translations is selected to match the length of speech. This is how the system creates the synthetic speech audio in the target language.

Speaker embedding assignment

The voice characteristics of each speaker are applied to identified speaker segments. Segment timing is considered to assign speech placements and manage pauses. The speech output is adjusted as needed to fit in the allotted time segment.

Use Cases for Different Needs

Unlike in the Webb Telescope video clip, in the “Casablanca” example the machine translation and the synthetic voices were post-edited. Steinbiss explained that post-editing is needed in most situations. AppTek offers both options, i.e., fully automatic or post-edited dubbing.

However, the possibilities don’t end with those two options. In his presentation, Steinbiss also showed four tiers, that is, combinations of processes for different markets and use cases resulting in different levels of quality.

Tier one is fully automatic technology. Tier two implies the availability of a correct script in the source language. Tier three involves a correct script in the target language as well (that is, the translation has been edited). Tier four combines tiers two and three, and adds editing of the synthetic voice. The more editing done, the higher the emotional range, resulting in a more natural dubbed speech.

Steinbiss explained that tier one is for noncritical content, when the intent is to provide an idea of what is being said in other languages. “It’s for content with a short lifecycle, but offers accessibility to foreign language speakers.”

Tier two is used when caption files are available. It aids in obtaining better translations thanks to an accurate transcript. It is suitable when the translation of content into many languages at once is needed. Tier three involves edited, accurate translation, such as the content needed for eLearning or corporate communications. Tier four could be used for documentaries, low-budget films or telenovelas.

Apptek Pricing Tiers

Post Editor for Synthetic Voices

The future is about scaling. As Steinbiss put it, “at this point in time, automatic dubbing is mostly for making available large amounts of content that otherwise wouldn’t be dubbed in many languages.” 

Steinbiss also envisions that automatic technology like that offered by AppTek will bring about new roles. For example, he offered “synthetic voice director” and “post editor for synthetic voices.” 

What’s next in the development pipeline for AppTek? Dubbing for live events, including news. There will also be improvements in emotion, prosody for synthetic voices, and lip-synching. 

In closing, Steinbiss asked the audience to think about what’s in it for them with this technology, including business expansion. He added that the development of planned improvements will be fast, but that there are already a lot of opportunities to use the current technology to expand multilingual access to massive amounts of audio content.