Meet Open-Source GenTranslate, a Speech and Text Translator Built on Meta’s SeamlessM4T

GenTranslate open Source

In a February 10, 2024 paper a group of researchers including from Singapore’s Nanyang Technological University and chip giant NVIDIA introduced GenTranslate.

According to its creators, GenTranslate is a novel generative paradigm for translation tasks that leverages large language models (LLMs) to generate better results by considering diverse translation candidates and benefiting from the rich information they contain.

As the researchers explained, traditional speech translation (ST) and machine translation (MT) models employ the typical beam search algorithm and select the top-1 hypothesis as the final output.

This means that when provided with input speech or text in a source language, these models perform translation into a target language using beam search decoding, which generates a list of N-best hypotheses containing multiple potential translations. Subsequently, they select the most probable translation as the output (referred to as the top-1 hypothesis).

However, this method may discard valuable semantic information present in a broader range of alternative hypotheses (ranging from 2 to N-best), which could enhance the accuracy of the generated translations. The researchers characterized the typical top-1 hypothesis selection as “sub-optimal.”

GenTranslate improves upon traditional beam search decoding and top-1 hypothesis selection in translation tasks. By employing LLMs, it considers diverse translation candidates (N-best hypotheses) to generate a single, high-quality translation. Specifically, the diverse N-best hypotheses generated by a foundation ST or MT model are fed into the LLM. The LLM leverages its linguistic knowledge and reasoning capabilities to enhance translation quality by processing the diverse translation versions to capture the nuances and context of the input for improved translation accuracy.

This approach ensures that the final translation result benefits from the rich information contained in the multiple translation versions in the N-best list. “We leverage LLMs to integrate the diverse translation versions in the N-best list to generate an informative and higher-quality translation result,” they said.

The researchers used SeamlessM4T, a multimodal model resealed by Meta in August 2023, which offers different combinations of text and speech translation for dozens of languages, as the foundational translation model for both ST and MT tasks within the GenTranslate system.

To support the LLM fine-tuning for GenTranslate, they released a new dataset, called HypoTranslate, containing nearly 0.6m pairs of N-best hypotheses and ground-truth translations in 11 languages, providing a diverse set of examples for LLMs to learn from.

The model learns to align N-best hypotheses with the accurate translation by utilizing the ground-truth translation as a reference point during training. This means that during the training phase, the model is guided by the actual correct translations to learn how to generate accurate outputs based on the diverse hypotheses it considers. 

Effectiveness and Generality

GenTranslate showed improvements over various baselines, tasks (ST and MT), test datasets (FLEURS, WMT), and language directions (X→En and En→X), verifying the “effectiveness” and “generality” of the approach, according to the researchers. “Experiments on various ST and MT benchmarks show that our GenTranslate significantly outperforms the state-of-the-art model,” they said.

For speech translation — for this task they investigated both end-to-end ST and cascaded ASR+MT — the performance of the GenTranslate model was evaluated on the FLEURS and CoVoST-2 datasets for translating from language X to English and from English to language X. 

GenTranslate showed consistent improvements over strong baselines like Whisper, AudioPaLM2, and SeamlessM4T-Large, achieving a significant improvement over the best-performing SeamlessM4T-Large. In addition, a comparison was made between end-to-end ST and cascaded ASR+MT methods, with the cascaded system outperforming the end-to-end system. 

For machine translation, evaluation was conducted on the FLORES dataset for X→English MT and on WMT test sets for English→X MT. GenTranslate achieved state-of-the-art performance with consistent gains in all language directions except Japanese→English, surpassing competitive baselines like ALMA, BigTranslate, and NLLB.

The researchers open-sourced their work on GitHub.

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng