In Machine Translation, Gold-Standard Translations Are Not Always Gold

Machine Translation Gold Standard

In a February 2, 2024 paper, machine translation researchers from Johns Hopkins University and Microsoft emphasized that gold-standard translations are not always “gold”. 

They introduced a new fine-tuning approach called contrastive preference optimization (CPO) that aims to help models avoid generating near-perfect yet flawed translations by using carefully curated preference data. 

CPO is a “more efficient variant” of direct preference optimization (DPO) that integrates preference learning into the training process. The researchers suggested that implementing CPO could significantly enhance the performance of moderate-sized large language models (LLMs) in machine translation (MT), matching or even surpassing the capabilities of GPT-4.

They explained that CPO addresses two main issues with traditional supervised fine-tuning (SFT) methods, pushing “the performance boundary of models that have reached saturation through SFT training.” 

Firstly, SFT focuses on making model outputs match reference translations, thus potentially limiting the model’s performance to the quality of the training data, which might not always be perfect. “Even human-written data, traditionally considered high-quality, is not immune to quality issues,” they said. Their analysis of the FLORES-200 dataset revealed instances where the quality of human-written parallel data was even inferior to that of system-generated translations. This finding led them to question the effectiveness of training models solely based on replicating reference translations.

Secondly, SFT lacks a mechanism to prevent the model from making its own mistakes. Sometimes, even though a translation may seem good, it might contain small errors like missing words, they explained. CPO helps address these problems by training the model to avoid producing near-perfect but ultimately flawed translations, leading to significant enhancements in translation performance, surpassing the capabilities of traditional SFT methods.

High-Quality Preference Dataset

CPO requires access to labeled preference data, yet such data is scarce in MT. To facilitate the implementation of CPO, the researchers built and released a high-quality preference dataset for ten language pairs: English </> German, Czech, Icelandic, Chinese, and Russian. 

This dataset, derived from the FLORES-200 dataset, includes three translations per source sentence: the original target reference, a translation from GPT-4, and a translation from ALMA. The highest-scoring translation is labeled as preferred, while the lowest-scoring translation is labeled as dis-preferred. “This approach of using high-quality but not flawless translations as dis-preferred data aids in training the model to refine details and achieve perfection in generated translations,” they explained.

Significant Advancement

The researchers further fine-tuned the ALMA-13B-LoRa (Advanced Language Model-based trAnslator), an LLM released in 2023, which is “one of the top moderate-size language-model based translation systems” surpassing even larger models such as GPT-3.5 or conventional models such as NLLB-54B.

They compared the new fine-tuned model, named ALMA-13B-R, against other recently released 13B LLM-based models, as well as top-performing translation systems like GPT-4 and TowerInstruct.

The results demonstrated that ALMA-13B-R either matched or even outperformed these advanced translation models, showcasing that the application of the CPO method to fine-tune the ALMA-13B- LoRA significantly enhances the model’s capabilities, bringing its performance to the level that is equal or even surpasses that of GPT-4. For the evaluation, they used wmt23-cometkiwi-da-xxl, XCOMET-XXL, and wmt22-cometkiwi-da

Finally, the researchers noted that CPO not only improves the translation capabilities but also offers advantages in terms of memory efficiency and speed, concluding that this marks “ a significant advancement in the field of MT.”

Authors: Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim