Microsoft Says Large Language Models Are SotA Evaluators of Translation Quality

Machine Translation Evaluation Microsoft Large Language Model

On February 16, 2023, Microsoft announced that large language models (LLMs) can achieve high machine translation quality, mainly for high-resource languages. Building on this finding, Tom Kocmi, Senior Researcher at Microsoft, and Christian Federmann, Principal Research Manager at Microsoft, investigated the applicability of LLMs for automated assessment of translation quality. “If the model can translate, it may be able to differentiate good from bad translations,” they said.

In their research paper published on February 28, 2023, they proposed GEMBA, a GPT estimation metric-based assessment method. Kocmi and Federmann evaluated the ability of seven different GPT models, including ChatGPT, to assess translation quality using GEMBA.

According to the researchers, LLMs demonstrate “state-of-the-art capabilities” in translation quality assessment at the system level. However, they emphasized that only GPT 3.5 and larger models are capable of achieving state-of-the-art accuracy when compared to human judgments. Those findings provide “a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations,” they said.

“Unexpected” Performance

Kocmi and Federmann outlined the requirements for assessing translation quality using LLMs. These include: a prompt variant, a source language name, a target language name, a source segment, a candidate translation, and a reference translation — which is optional for quality estimation.

The researchers experimented with four different prompt types, modeling two scoring tasks and two classification tasks. This was done because “scoring of translation quality may be an unnatural task for an LLM.” The scoring tasks were based on direct assessment and on scalar quality metrics, while the classification tasks involved rating translation quality using a one-to-five stars system and labeling translation quality as one of five discrete quality classes. Moreover, they evaluated these four prompt variants in two modes: with a reference translation and without a reference translation (in a quality estimation setting).

The researchers assessed the performance of GEMBA by comparing it to other top-performing automatic metrics such as COMET and BLEURT. They used data provided by the WMT22 Metrics shared task, which compares these metrics against human ratings for the English into German, English into Russian, and Chinese into English language pairs.

According to the researchers, GEMBA demonstrated “unexpected” levels of metric performance. More specifically, GEMBA outperformed all other reference-based metrics while also achieving the highest performance in the quality estimation mode. However, the results also showed that GEMBA is not yet reliable enough on the segment level and should only be applied for system-level evaluation.

Progress in Document-level Evaluation

Then, Komci and Federmann evaluated the performance of seven different GPT models using GEMBA, their proposed metric-based assessment method. The GPT models ranged from GPT 2 to the latest ChatGPT model. 

The researchers found that Davinci-002 and Davinci-003 (also known as GPT 3.5) and ChatGPT demonstrated great performance in the translation assessment task for all of the prompt variants, with Davinci-003 achieving the best performance.

ChatGPT performed slightly worse than the other two models, often providing an explanation of its scoring. The researchers suggest that this may be due to the prompt format not instructing ChatGPT not to generate an explanation, and different prompts could potentially improve the model’s performance.

The researchers made their code and prompt templates publicly available, along with all corresponding scoring results, “to allow for external validation and reproducibility.” They concluded that GPT-enhanced evaluation metrics could lead to progress in document-level evaluation due to their ability to use much larger context windows, which “could be beneficial as there is little research into document-level metrics.”