Google Presents the First On-Device, Real-Time Speech-to-Speech Translation Model

SimulTron On-Device Simultaneous S2ST

In a June 4, 2924 paper researchers from Google Research and Google DeepMind presented SimulTron, a model designed for on-device, real-time speech-to-speech translation (S2ST) built upon the Translatotron architecture.

The researchers highlighted the ongoing evolution of S2ST technology while emphasizing the persistent challenge of achieving accurate, real-time, on-device simultaneous translation. They noted that existing simultaneous translation models are not adequately optimized for the unique constraints of mobile devices, underscoring the significant challenge of achieving seamless real-time translation on mobile devices.

“Today, with smartphones and tablets being central hubs for personal and professional interactions, on-device S2ST is crucial,” they said.

In response to this challenge, they introduced SimulTron, a model that leverages the strengths of Translatotron while incorporating key modifications specifically tailored for the on-device, simultaneous translation scenarios.

“By bringing real-time, simultaneous translation directly to mobile devices, we envision a future where language barriers are significantly reduced”. — Agranovich et al.

According to the researchers, “SimulTron establishes a milestone as the first method to demonstrate real-time S2ST on a device.”

To assess the effectiveness of SimulTron, they carried out a series of experiments focused on English-Spanish translation tasks using the MuST-C dataset. Evaluation metrics included BLEU scores for translation accuracy and a human evaluation using a standard 5-point scale to assess the naturalness of the translated speech.

The results of these experiments showcased SimulTron’s impressive real-time translation capabilities on devices such as the Pixel 7 Android Pro, demonstrating its ability to operate effectively within the constraints of mobile devices.

Furthermore, SimulTron outperformed existing real-time S2ST approaches on the MuST-C dataset in terms of both BLEU scores and latency, highlighting its efficacy in translating speech while preserving natural characteristics.

“By bringing real-time, simultaneous translation directly to mobile devices, we envision a future where language barriers are significantly reduced, fostering greater understanding and collaboration across cultures,” concluded the researchers.

Future research directions aim at expanding SimulTron’s multilingual capabilities, optimizing for various mobile hardware configurations, and exploring techniques to enhance translation quality under challenging acoustic conditions.

Authors: Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia, Nadav Bar, Heiga Zen, and Michelle Tadmor Ramanovich