Korean Researchers Promise Live, Lip-Synced Speech Translation

Korean Researchers Promise Live, Lip-Synced Speech Translation

In a March 26, 2024 paper, Jeongsoo Choi, Se Jin Park, Minsu Kim, and Yong Man Ro from the Korea Advanced Institute of Science & Technology (KAIST) introduced a novel framework for direct audio-visual speech to audio-visual speech translation (AV2AV), where both input and output are multimodal.

Specifically, the proposed AV2AV framework takes both audio and visual speech as inputs, translates the linguistic content, and generates both audio and visual speech outputs, providing users with a multimodal experience.

The authors noted that “multimodal (i.e., audio and visual) speech translation is in its very early stages,” with their work being the first to explore direct AV2AV, where inputs and outputs are both audio-visual.

What are the main advantages?

First, AV2AV offers synchronized lip movements along with the translated speech, simulating real face-to-face conversations and providing a more immersive dialogue experience. Second, it enhances the robustness of the spoken language translation system by leveraging complementary information from audio and visual speech, ensuring accurate translations even in the presence of acoustic noise.

Additionally, the authors suggest that an AV2AV approach provides a faster and more cost-effective solution for audio-visual speech translation compared to traditional 4-stage cascaded speech to audio-visual speech translation approaches, which involve a sequential process of automatic speech recognition (ASR), neural machine translation (NMT), text-to-speech synthesis (TTS), and audio-driven talking face generation (TFG).

Increased Demand and Effectiveness

The authors stressed that “in today’s world, where millions of multimedia content pieces are generated daily and shared globally in diverse languages, the demand for systems like the proposed AV2AV is anticipated to increase.”

However, developing a direct AV2AV system is challenging due to the lack of existing data for training. While text and speech datasets are abundant, there is a scarcity of parallel audio-visual speech data. “As there is no available AV2AV translation data, it is not feasible to train our model in a parallel AV2AV data setting,” they said.

They explained that one approach to address this challenge would be to generate this data artificially by creating speech and video separately. However, they acknowledged that this method may not yield optimal results due to limitations in accurately replicating lip movements. Instead, they demonstrated that the proposed AV2AV framework can be trained using audio-only data to facilitate translation between AV speech. 

Moreover, as the proposed AV2AV can be trained without using text data, the authors noted that the system can serve languages with no writing systems.

“The demand for systems like the proposed AV2AV is anticipated to increase.”

The effectiveness of AV2AV was validated through extensive experiments in a many-to-many language translation setting. Since there was no previous method that could perform AV2AV, the authors compared its performance with the state-of-the-art direct audio-visual speech-to-speech translation model, AV-TranSpeech. The results showed that the proposed method is “much more effective” than AV-Transpeech, especially in the low-resource setting.

A demo page showcasing the AV2AV system is available at choijeongsoo.github.io/av2av