Are Large Language Models Good at Translation Post-Editing?

Are Large Language Models Expert Translation Post-Editors

A team of researchers from Microsoft Azure AI explored the potential of applying large language models (LLMs) in automatic post-editing machine translation (PEMT) across various language pairs. The researchers published their findings on October 23, 2023.

The study assessed if LLMs improve the quality of machine-generated translations through meaningful edits, effectively rectifying translation errors and enhancing overall translation quality.

The researchers explored the efficacy of LLMs in the context of post-editing (PE) operating in a direct setting. This entailed the absence of any quality-estimation or error detection steps applied to the translations prior to post-editing.

A range of language pairs were tested, including English-Chinese, English-German, Chinese-English, and German-English. 

GPT-4 and GPT-3.5-turbo were selected as the LLMs of choice due to their advanced capabilities, acknowledged as the “the most capable publically available LLMs.”

According to researchers, this marks “the first work that investigates using GPT-4 for automatic post-editing of translations.”

Leveraging advanced language models like GPT-4 for this task could help detect and correct errors in machine-generated translations, enhancing their reliability. “LLM based automatic translation post-editing could aid in both detecting and fixing translation errors to ensure greater reliability of MT outputs,” the researchers said.

They also underlined that LLMs’ multilingual capabilities make them suitable for automatic post-editing tasks because beyond error correction they can also apply knowledge-based or culture-specific customizations to translations.

Their methodology involved using a prompt that framed the system’s role as a translation post-editor, instructing the LLM to propose improvements to the provided translation of a given source, before generating the final post-edited translation. The quality assessment involved a comparison between post-edited translations and the initial translations using reference-free and reference-based machine translation quality metrics. 

The study revealed that GPT-4 is “adept” at post-editing. It not only produced meaningful and trustworthy edits but also removed different types of major errors in the translated text. 


A crucial aspect of LLM automatic post-editing was the “trustworthiness” of the proposed edits made by the model. The researchers highlighted that “the fidelity of the proposed edits is important for imparting more trust in the LLM based post-editing process.”

To evaluate this, they employed an Edit Realization Rate (ERR) metric, measuring the extent to which the LLM’s suggested modifications were effectively incorporated into the final improved translation. The more often the LLM’s suggested changes were accurately incorporated into the final translation, the higher the trustworthiness and reliability of the LLM.

Since there was no ground truth data to quantify this, they used human evaluation to measure this property. These evaluations confirmed GPT-4’s capability to produce more trustworthy edits, making it a reliable choice for automatic post-editing. The researchers said that “GPT-4 could aid in automatic post-editing with considerably greater interpretability.”

However, the occasional occurrence of hallucinated edits prevents them from asserting that GPT-4 should be considered an expert post-editor. “GPT-4 could produce hallucinated edits, thereby urging caution in its use as an expert translation post-editor,” they said. They also noted that “GPT-4 might present similar reliability challenges as a post-editor as NMT does in translation.”

Authors: Vikas Raunak, Amr Sharaf, Yiren Wang, Hany Hassan Awadalla, and Arul Menezes