How To Improve Post-Editing Capabilities of Large Language Models

How To Improve Post-Editing Capabilities of Large Language Models

In an April 11, 2024 paper, Dayeon Ki and Marine Carpuat from the University of Maryland demonstrated that guiding large language models (LLMs) with external feedback on quality improves their machine translation post-editing (MTPE) capabilities.

The authors acknowledged that previous studies explored the potential of applying LLMs in automatic post-editing machine translation (PEMT). However, their work differs in several ways. 

Firstly, they used external feedback to guide LLMs in refining translations, rather than relying on self-generated feedback within the model. Secondly, they provided the option for using post-editing outputs from any model, not limited to improving the LLM’s own translations. Thirdly, they worked with the open-source LLaMA-2 instead of the largest closed LLMs like GPT-3.5, GPT-4, or PaLM-2. 

Specifically, they experimented with the 7B and 13B variants, arguing that “it is […] worth exploring to what extent LLMs of more moderate size (e.g., 7B, 13B) can perform post-editing, as such models are less costly to train, run, and deploy in actual applications.” Additionally, they highlighted that working with open models facilitates the reproducibility of results and encourages others to build on this work.

Ki and Carpuat considered two strategies for guiding language models to edit MT error annotations: prompting and fine-tuning with instructions. 

First, they prompted LLaMA-2 using different forms of feedback with different levels of granularity: 

  • generic feedback: This level of feedback does not provide specific details but prompts the model to improve the initial translation without any specific external guidance.
  • score-based feedback: In this category, a single scalar MQM score (from 0 to 100) is provided to the model, reflecting the overall quality of the initial translation. This score helps the model understand the quality of the translation and make improvements accordingly.
  • fine-grained feedback: This is the most detailed level of feedback, where specific error annotations are provided to the model. These annotations can include information about error spans, types of errors, and severity levels. Fine-grained feedback can be annotated either by humans or automatic annotation tools. 

Specifically, they considered three different sources of error annotation: (i) human annotations from the MQM WMT22 dataset, (ii) automatic annotations generated by InstructScore, an explainable text generation evaluation metric that fine-tunes LLaMA to predict fine-grained error annotations in the MQM style, and (iii) automatic annotation provided by xCOMET, an automatic evaluation and quality estimation tool that fine-tunes XLM-RoBERTa to predict both MQM and Direct Assessment annotations of MT quality.

Improved Quality and Post-Editing Effort

Focusing on three language pairs (i.e., Chinese-English, English-German, and English-Russian), they found that prompting LLMs to edit MT with feedback consistently improved translation quality and post-editing effort.

Ki and Carpuat noted though that the fine-grained feedback on errors seemed to have limited benefits over generic feedback, while score-based feedback showed to have the least improvement in the MT output.

Extra Performance Boost and Natural Outputs

Next, they fine-tuned LLaMA-2 with fine-grained error annotations and found that fine-tuning gives an “extra boost in performance.” 

They noted that while prompting experiments did not show a clear preference for specific levels of feedback granularity, fine-tuning with fine-grained feedback consistently resulted in higher translation quality compared to fine-tuning with generic feedback. “This shows that fine-tuning allows the models to take advantage of the fine-grained feedback more effectively,” they said. 

Additionally, human evaluation showed that the fine-tuned models not only can fix targeted errors but also enhance naturalness in the target language. “Our analysis reveals that prompting the fine-tuned LLMs with fine-grained feedback not only helps fix the errors highlighted in the prompt, but also leads to more natural outputs,” they said. 

Ki and Carpuat concluded that “these results clearly show that post-editing MT output does not require the largest proprietary LLM models and can be done with smaller open-source models.”

They plan to further explore how to create a workflow that can automatically assess any MT input and decide if post-editing is necessary and how it should be post-edited, as well as determine the most suitable feedback mechanism to use. Additionally, they want to further explore how to minimize the reliance on human annotations “which are expensive to obtain at scale.”

Ki and Carpuat released their code, dataset, and model checkpoints on GitHub.