Fine-Tuned LLMs Are Good at Document-Level Machine Translation, Research Finds

Finetuned LLMs and Machine Translation

In a January 12, 2024 paper Minghao Wu, Thuy-Trang Vu, Lizhen Qu, and Gholamreza Haffari from Monash University, along with George Foster from Google, explored the adaptation of large language models (LLMs) for document-level machine translation (MT).

They fine-tuned and tested moderately-sized LLMs (with 7B parameters) across 18 translation tasks involving nine language pairs (English </> Arabic, German, French, Italian, Japanese, Korean, Dutch, Romanian, and Chinese), utilizing LLAMA2-7B, BLOOM-7B, and VICUNA-7B as backbones. 

Furthermore, they compared these fine-tuned LLMs with state-of-the-art MT models, including NLLB and Google Translate, as well as state-of-the-art LLMs like GPT-3.5-TURBO and GPT-4-TURBO. To assess translation quality, they employed BLEU, SacreBLEU, and COMET metrics.

The authors underscored that, while previous studies focused on LLMs for document-level MT through prompting techniques, their study concentrated on analyzing the effectiveness of parameter-efficient fine-tuning (PEFT) and full fine-tuning (FFT) methods on moderately-sized LLMs in the context of document-level MT. 

PEFT involves fine-tuning LLMs with a smaller amount of training data to achieve improved performance while requiring fewer examples for fine-tuning. In contrast, the FFT involves fine-tuning the entire training dataset to achieve optimal performance by allowing the model to learn from a larger amount of training data.

The authors fine-tuned these three moderately-sized LLMs using these two strategies, comparing their performance to understand their impact on overall performance and data efficiency in document-level MT tasks. They followed a two-stage training strategy: initial fine-tuning on monolingual text, followed by a second fine-tuning phase on parallel text. 

According to the authors, “this comprehensive study aims to advance understanding and improve the efficacy of LLMs in document-level MT tasks.”

Overall Performance and Off-Target Translations

They found that GPT-4-TURBO and GPT-3.5-TURBO outperformed all other models. However, when translating from other languages into English, moderately-sized LLMs, in some cases, exhibited superior translation performance, even surpassing GPT-4-TURBO. 

Nevertheless, they significantly suffered from off-target translation issues — where a translation in a language different from the target language is provided — in others, despite being exclusively fine-tuned on bilingual corpora in these languages.

The study also provided an in-depth analysis of the translation error distribution, shedding light on the strengths and limitations of LLM-based document-level MT models. The authors noted that when achieving similar performance, the LLM-based document-level MT models exhibited fewer context-independent and context-dependent errors. “Fine-tuning LLMs for machine translation is a promising research direction, particularly for improving document-level translation quality,” they said.

The PEFT approach demonstrated superior overall performance compared to the FFT approach. However, the FFT approach showed better data efficiency, requiring only about 1% of the total dataset to match the performance of models trained on the entire training set, while the PEFT approach required 10% of the total dataset to achieve comparable results.

Prompting Methods

The authors highlighted the significant role of prompting methods in fine-tuning, aiming to address two research questions: How does the context structure affect the translation quality? and How do the natural language instructions affect the translation quality?

They emphasized that prompts play a significant role in LLM performance, but their effectiveness can vary across different models. Specifically, they found that a well-structured prompt that combines an appropriate context structure with extra contextual information and natural language instructions can significantly boost model performance. However, natural language instructions are less effective when using instruction-tuned language models as model backbones. 

Their analysis also revealed that LLM-based document-level MT models can handle data that differs from the training domain, showing promise for translating out-of-domain text. Additionally, they investigated whether the translation capability acquired from translating between one language pair can be transferred to other language pairs and found that during fine-tuning on parallel documents, LLMs are more likely to activate their inherent translation capabilities rather than developing new translation skills.

The authors concluded that “the findings of this research not only shed light on the strengths and limitations of LLM-based document-level MT models but also provide a foundation for future research in document-level MT.