Study Shows Improved Quality Estimation via Fine-Tuning LLMs with Post-Editing Data

Machine Translation Quality Estimation 2023

In a research paper published on August 10, 2023, Serge Gladkoff and Gleb Erofeev from Logrus Global, along with Lifeng Han and Goran Nenadic from the University of Manchester delved into the potential of state-of-the-art large language models (LLMs) fine-tuned to identify errors in a machine translation (MT) output and predict translation quality.

As the authors explained, modern translation projects often involve post-editing of machine translation (PEMT), wherein the MT output undergoes review and revision by post-editors to eliminate errors and ensure alignment with customer requirements.

Quality estimation (QE) of MT has always been a critical topic for PEMT due to its role in validating the MT outputs without seeing the reference translations, especially when reference translations are not available.

The authors also highlighted that even with languages that are not handled well by MT, there is a significant portion of segments — varying from 10% to 70% — that remain unchanged during PEMT. 

The authors proposed the application of machine learning (ML) methods to identify such segments. “This could further speed up the translation process and decrease the costs while preserving the premium quality of the translated product,” they said.

For their study, they chose ChatGPT as one of the state-of-the-art LLMs and fine-tuned it using OpenAI’s API and historical post-editing data on English-Italian and English-German. The model learned using the triple set of inputs: the original English source, the MT output, and the human-edited reference translation. The custom fine-tuned model named GPT4QE (GPT for QE) was created in their private space on the OpenAI account. The authors emphasize that they did not use prompt engineering or edit distance training, but instead allowed the model to learn from the context and content of the training data. 

Challenges and Implementation Strategies

The results of the study demonstrated that GPT4QE is able to make significant predictions about whether a translation segment requires post-editing and achieve a relatively high score on predicting translation quality. “To the best of our knowledge, this work is the first (or among the first) to investigate the ChatGPT LLM model on such MT error prediction tasks with positive outcomes,” said the authors.

However, a challenge arises when dealing with segments predicted as not requiring post-editing but actually do. This scenario, referred to as “LAI” (leave as is) segments, presents a significant issue in terms of quality assurance highlighting the complexities of implementing artificial intelligence (AI) predictions in production workflows.

The authors discuss two potential strategies for implementing this prediction in production:

  • Exclusion of LAI Segments: In this approach, segments predicted as not requiring post-editing are bypassed in human review and directly published. However, the potential risk of errors in the published content needs to be considered.
  • Review of LAI Segments: Alternatively, LAI segments are marked as “100% MT matches” and translators are tasked with reviewing them, albeit at a lower rate. This allows for some human oversight while still significantly reducing post-editing time and cost.

The research underscores AI-driven predictions’ potential to streamline the translation process, enhance efficiency, and reduce costs. However, the implementation must be carefully considered to ensure the preservation of quality.

Looking ahead, the researchers plan to expand their dataset and explore more nuanced classifications of translation quality, going beyond binary classifications. This signifies a continuous effort to refine the AI’s predictive capabilities and further integrate it into the translation pipeline.