Researchers Improve AI Translation by Having Translators Give ‘Light-Weight’ Feedback

Translation Feedback Editing

In April 2024, Dayeon Ki and Marine Carpuat from the University of Maryland demonstrated that guiding large language models (LLMs) with external feedback on quality improves their machine translation post-editing (MTPE) capabilities.

In their study, they used different forms of feedback with different levels of granularity: generic, score-based, and fine-grained, with the latter showing the greatest promise.

In a June 4, 2024 paper, Nathaniel Berger and Stefan Riezler from Heidelberg University along with Miriam Exel and Matthias Huck from SAP SE suggested that “light-weight” feedback is sufficient to effectively guide LLMs in self-correcting translations, even in technical domains where LLMs may still lag behind compared to general domains. 

They proposed a two-step process that leverages human feedback (i.e., error markings) to enhance the capabilities of LLMs. 

During the first step, translators identify and mark errors in the machine-generated translations. Error markings are inside bold-faced tags <bad> </bad> and do not provide any information on the types of errors or their severity. 

These error-marked segments are then used to prompt the LLMs, guiding them to focus on correcting the marked errors by referencing similar examples from a post-editing translation memory (PE-TM).

A PE-TM consists of source segments, machine translations, and reference translations, enriched by lightweight human error markings on machine translation. By providing the LLM with instances where errors have been correctly identified and corrected, the LLM can learn from these examples and apply similar corrections to its own translations.

To test the effectiveness of this process, they conducted a pilot study in the IT domain and the English-German language pair. First, for creating a PE-TM they used data from open-source software documentation that were annotated by professional translators. Then, they employed Llama 13B and GPT-3.5 for generating and correcting translations.

They considered three machine translation tasks: machine translation from scratch, automatic post-editing, and post-editing with error markings

In the first scenario, models were prompted to simply translate the text. In the second scenario, models were prompted to read the original text and the translation hypothesis and then correct the output. And in the third scenario, models were prompted to read the original text and the translation hypothesis and then correct the output using the provided error markings.

Prompt: Read the English text and the German translation hypothesis and then correct the output. Incorrect words are inside of tags ’<bad> </bad>’. Please use this feedback in your correction. If the hypothesis is already correct, do not make any changes.

The researchers noted that giving the error markings as in-line tags would be easier for the model to parse and integrate into its output than including another line where errors would be indicated further away from the corresponding tokens.

They found that providing error markings significantly improved the LLM’s ability to correct translations. The approach outperformed translation from scratch and automatic post-editing. “Overall translation quality is improved over few-shot prompt-based translation and over automatic post-editing,” they said.

Additionally, they found that the LLM that produced the translation hypotheses identifies its own translations as correct, and therefore does not act on the instructions to correct errors. However, when prompted with error markings, the LLM learned to act on them, with 68% of the edits being correct according to human evaluation, compared to 32% during automatic post-editing.