The news about streaming subscribers dropping off the leading over-the-top (OTT) platform could be a canary in the coal mine (or gold mine, if you will) for other streaming giants. But for a media localization market inundated with projects, demand remains strong in the face of a serious talent crunch: lots of content, not enough translators, subbers, or dubbers.
The streaming explosion enjoyed by the OTT market led by Netflix has heightened demand over the last couple of years and, thus, content spend. However, the shortage in translator resources and related talent has affected territories where major streaming launches take place simultaneously — a major pain point for content localizers.
According to Kyle Maddock, Marketing SVP at AppTek, “Demand isn’t the problem. Talent is — and so is integrating the right tools to augment that talent. Is there anything we can automate? Can R&D be stepped up so we can use the tech sooner? These are some of the things market players are thinking about right now.”
Hence, content localizers are currently looking at the newest generation of game-changing technologies used in the massive non-entertainment, non-OTT, audio-visual market.
Here, early adopters are already deploying new tech to localize news clips, user-generated content, corporate videos, educational materials, fitness videos, documentaries, low-budget films, or even direct-response videos for online consumers / businesses to immediately expand a product’s global reach.
As Maddock pointed out, “These emerging language technologies, in their current state, may be more applicable to general purpose, content-production markets — even as they continue to evolve to support the more complex needs of the high-end market.”
According to the AppTek SVP, “Emerging localization tech is already usable and can certainly ease the pressure off localization workflows that are packed with projects from OTT and other premium media services.”
These new technologies, Maddock said, are also applicable to the high-end market when there is a human in the loop.
6-Step Automatic Dubbing Pipeline
At AppTek, R&D has been in full swing for a while in areas that would otherwise have been treated as frontier tech (with a few years yet ahead for market traction) had it not been for spiraling demand. One such area is automatic dubbing.
The company’s R&D team has been tackling the complex, cross-disciplinary research problem of speech-to-speech (S2S) translation by building a pipeline — with added features that aim to produce media output that can match the speech characteristics of original speech input. To be clear, automatic dubbing is not synonymous with the buzzwords “AI dubbing” and “voice cloning.”
While there are lots of examples of AI dubbing (i.e., deepfakes, where lip movements are changed) and voice conversion (where one person’s voice is masked with another), automating a full dubbing pipeline is a more sophisticated and complex affair.
The pipeline comprises six steps.
- Audio extraction and preprocessing
- Speech recognition and segmentation
- Speaker grouping and feature extraction
- Machine translation
- Speech synthesis
- Creating the final video
AppTek’s Lead Scientist for Speech Translation, Mattia Di Gangi, explained, “The input of the pipeline is a regular video file containing a single video and a single audio stream. The output of the pipeline is a video file containing the same video stream with the addition of a new target language audio stream.”
How the Tech Works
Automatic dubbing begins with a phase that consists of audio preprocessing, transcription, segmentation, and feature-vector extraction.
The original audio is first extracted from the video source. Next, residual sound (e.g., music, background noise) is removed to extract voice features, and added back when generating the final audio stream. This is particularly important for the speaker voice adaptation step that takes place later in the automatic dubbing process.
Once the audio has been preprocessed, domain-adapted speech recognition systems generate transcripts with precise timestamps for the start and end of each word.
“At AppTek, we utilize our media and entertainment ASR system, which has been trained on large amounts of broadcast data,” Di Gangi said. To inform the ASR output for unique terminology (e.g., Quidditch, from Harry Potter), custom lexicons and dictionaries can also be used, if available.
Di Gangi added, “ASR includes a punctuation system — which outputs the text in appropriately punctuated segments — while speaker diarization assigns speaker labels to each segment. The output is combined to form well-structured and speaker-segmented input for the subsequent machine translation step.”
It is possible, of course, to add a human-in-the-loop, post-editing step to perfect the transcription at segment level and speaker diarization prior to feature-vector extraction, which is used for speaker adaptation purposes when the target voice is synthesized.
Next, a domain-adapted machine translation (MT) system, specializing in the type of language required as output, translates the transcribed segments into the target language.
“These MT systems also need to be enhanced with additional parameters to allow for dubbing-specific features that must be accommodated in the MT output,” Di Gangi pointed out.
The AppTek Lead Scientist enumerated several issues that need to be addressed in the MT part of the automatic-dubbing pipeline, such as how to…
- Use previous text and speaker ID information as additional context for better automatic translation of the current sentence;
- Achieve similar character-sequence lengths between source sentences and automatic translations in the target language to achieve isochrony, a key dubbing requirement (in other words, how to achieve a translation that can be uttered at a natural pace in the same amount of time as the source sentence);
- Select the best translation length considering the global constraints in the translated document, rather than the local constraints of a single sentence, which would improve the viewing experience;
- Achieve prosody-awareness (crucial to all types of dubbing synchrony) by explicitly modeling speaker pauses.
To inject some real-world context into translation and improve quality — around issues such as gender, register, translation length, and so on — AppTek has been working on using metadata to inform MT output, as Evgeny Matusov, AppTek’s Lead Science Architect for MT, explained in an interview.
A manual, post-editing machine translation (PEMT) step can again be applied to perfect the output before moving on to synthesizing the target speech. After the text is translated, segmented, and made ready for voicing, the next step can begin.
A general text-to-speech (TTS) approach is to train regular, single-speaker or multi-speaker models on a predefined set of voices, which can be used to generate synthetic speech.
A more sophisticated approach, known as “zero-shot multi-speaker TTS,” is to build TTS models capable of mimicking the voice from source audio without fine-tuning the model on the new speaker. Instead, speaker characteristics, extracted from a few seconds of speech, are used as input in the synthesis process.
TTS models must also be able to reproduce various aspects of a source voice with precision, such as speaking rate and emotions.
According to AppTek’s Di Gangi, “We can also control other aspects of a voice, such as pitch and energy. This control can be passed on to a human-in-the-loop via SSML tagging, so corrections can be made to the TTS output as needed.”
Di Gangi further explained that once the synthetic audio is ready, the residual audio extracted from the source is merged with the synthetic audio track to generate the final audio in the target language. The original dialogue track can also be added in the background at a lower volume, if needed, as is the case with UN-style voice-overs.
Near-Instant S2S
So how long does this entire automatic dubbing process take? According to AppTek’s Maddock, “For a completely automatic process, with no human-in-the-loop post-editing steps, the duration from start to finish can be even shorter than the video’s running time!”
The AppTek SVP added, “The studio-based professional services currently used by the market can take several weeks for lip sync dubbing and a few days for voice-over.”
Therefore, the almost-instant S2S delivery outlined here opens up the application of automatic dubbing to more audiovisual products, languages, and locales than ever before.
AppTek’s competitive advantage is that it includes all these technologies in a single stack. Furthermore, relevant scientific teams oversee and support the process and can collaborate with clients on a daily basis.
As Lead Scientist Di Gangi pointed out, “It isn’t easy to crack S2S, one of the hardest problems in natural language processing, by using siloed, third-party components thrown together to make an S2S pipeline.”
Learn more about AppTek Automatic Dubbing and schedule a demo today.