Can Large Language Models Do Simultaneous Machine Translation?

Simultaneous Machine Translation with Large Language Models

On September 13, 2023, a team of researchers from Monash University released research on the capabilities of large language models (LLMs) to perform simultaneous machine translation, or, as the researchers called it, SimulMT. 

The Monash team proposes a so-called “mixture policy” that enables LLMs to perform SimulMT without the need for additional training. They initially noted that LLMs have shown competitive performance in offline machine translation (MT) tasks, especially for high-resource languages. However, there have been no successful instances of employing LLMs in SimulMT, according to the authors, which involves translating text in real-time as it is being inputted.

The authors state that “unlike offline translation, in SimulMT, the source text accumulates incrementally over time, and the translation model needs to provide translations incrementally and synchronously. During this process, the model requires to have a certain policy to make decisions on taking READ or WRITE actions.”

To address these challenges, the researchers combined insights from conventional simultaneous translation models with a “wait-k” policy and incremental decoding to design a mixture policy tailored to LLMs. 

The “wait-k” part of the mixture policy instructs the computer on how much to read before starting to translate, while the “incremental decoding” part instructs the computer on how much to translate before reading more.

Simultaneous Fine-Tuning

While this policy enabled LLMs to perform simultaneous translation to some extent, the authors explored further enhancements through Simultaneous Fine-Tuning (SFT). They observed that the initial policy could partially mitigate issues like hallucinations caused by incomplete source context but noted instances where the model produced locally coherent yet incorrect translations.

To mitigate the model’s tendency to attempt completion, they created a dataset of prefix-to-prefix data, which consisted of source sentences truncated to varying lengths. ChatGPT was used to generate corresponding target prefixes, which were then integrated into the multilingual training set.

The evaluation was conducted using nine language pairs from the MUST-C dataset. All language pairs had English as the source language and consisted of TED talk speech data. 

The training set contained between 100,000 to 200,000 samples per language pair, along with 2,000 test samples. The total training set size, including the additional 9000 prefix samples, amounted to 1.9 million samples.

Evaluation metrics included BLEU for translation quality and LAAL for latency, measured using the SimulEval toolkit. The Llama2-7B-chat model was used as the LLM.

The results showed that the new approach enabled LLMs to achieve their intrinsic offline translation performance during simultaneous decoding. After SFT, the models outperformed dedicated simultaneous translation models while maintaining lower latency. The addition of prefix training led to slight performance improvements in low-latency scenarios.

The authors also expressed their intention to validate this approach across a broader range of LLMs and languages and explore its integration with speech modalities in future work. “In future work, we plan to validate this approach across a wider range of LLMs and languages and explore its integration with speech modalities,” they said.

Authors: Minghan Wang, Jinming Zhao, Thuy-Trang Vu, Fatemeh Shiri, Ehsan Shareghi, Gholamreza Haffari