Top Language AI Researchers Propose New Way to Auto-Evaluate Machine Translation

Machine Translation Quality Estimation 2023 New

In March 2023, Tom Kocmi and Christian Federmann from Microsoft demonstrated that large language models (LLM) can be prompted to assess the quality of machine translation (MT), achieving state-of-the-art (SOTA) performance in assessing system-level quality.

However, their focus was primarily on score prediction (i.e. predicting a numerical value for quality) without considering the use of any annotated data — either through in-context learning or fine-tuning.

More recently, Serge Gladkoff and Gleb Erofeev from Logrus Global, along with Lifeng Han and Goran Nenadic from the University of Manchester, showed that fine-tuning large language models (LLMs) with post-editing data can improve quality estimation by recognizing the segments that require post-editing.

Now, a group of researchers from Google, the Carnegie Mellon University, the Instituto Superior Técnico, the Instituto de Telecomunicações, Unbabel, and Inspired Cognition combined LLMs with human annotations “to design an automatic metric that generates rich feedback similar to that generated by human experts in MQM.”

In an August 14, 2023 research paper the researchers show that the performance of LLMs in MT evaluation and quality estimation can be further improved by fine-tuning these models on “fine-grained human judgment data”, i.e. Multidimensional Quality Metrics (MQM).

These findings “might have significant implications for not only MT evaluation, but evaluation of machine-generated text in general” — Fernandes et al.

The researchers proceed to introduce AutoMQM, a prompting technique that leverages the reasoning and in-context learning capabilities of LLMs. 

AutoMQM asks the LLM to identify and categorize errors in translations based on the MQM framework. A quality score is derived automatically from the identified errors. “We don’t ask the model to produce a score, as the MQM framework provides an algorithmic procedure to obtain one from identified errors,” explained the authors. 

Better Performance

The authors based their experiments on PaLM and PaLM-2. They began by evaluating these models as MT metrics through score prediction prompting, comparing them against the GPT family of LLMs. PaLM-2 performed best in the zero-shot scenario (i.e., without any prior exposure to specific examples), exhibiting strong correlations with human judgments at the system level, but falling short of metrics like COMET at the segment level.

Then, the authors utilized the MQM ratings from the WMT’21 Metrics Shared Task as in-context learning example pools. They found that refining LLMs through finetuning enhances their performance, especially for reference-less evaluation. 

When assessing the relationship between model size (parameter count) and performance, they discovered that smaller models, which might not perform well in the zero-shot scenario, became competitive after fine-tuning. 

Furthermore, the researchers applied AutoMQM with PaLM-2 models, observing performance enhancements compared to only scoring prompts, especially for larger models.

More specifically, they noted that the performance with AutoMQM seems to (mostly) scale with the number of in-context examples. “This suggests that LLM evaluators are much more robust to the choice of in-context examples when prompted for AutoMQM rather than for score prediction,” they explained.


The authors also evaluated the interpretability of AutoMQM. Since AutoMQM provides not only scores but also the identified error spans, they compared the predicted spans with the errors marked by annotators in the MQM annotations. The results highlighted PaLM-2 models’ ability to identify most major errors in translations.

According to the authors, AutoMQM’s interpretability is a key advantage, as users can understand the specific errors contributing to the quality score. Graham Neubig, one of the authors, mentioned in a tweet that AutoMQM “achieves SOTA (state-of-the-art) evaluation accuracy and interpretable results for translation evaluation.

The authors concluded that these findings “might have significant implications for not only MT evaluation, but evaluation of machine-generated text in general, and further highlight the potential of using LLMs to provide AI Feedback,” while Patrick Fernandes, one of the authors, mentioned in a tweet that these findings indicate that “fine-tuning LLMs for fine-grained, interpretable evaluation may lead to the next generation of automatic evaluators.”

Authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat