Huawei on Machine Translation Quality Estimation With Large Language Models

Huawei on Machine Translation Quality Estimation With Large Language Models

In a March 21, 2024 paper, researchers from Huawei, Northeastern University, and Nanjing University provide a “clear and concise overview” of machine translation (MT) quality estimation (QE) with a focus on large language models (LLMs) for QE applications.

The researchers conducted an “in-depth exploration of nearly all the representative methods within the QE domain,” aiming to provide what they say is “a thorough and professional understanding of the current state of QE methodologies.”

While not introducing new information, the researchers noted that this paper will be “extremely useful for practitioners engaged in QE research and scholars interested in entering this field.”

The researchers classified the methods that have emerged throughout the development of the QE field into three main categories: those that employ handcrafted features, those grounded in deep learning, and those leveraging LLMs. 

They explained that in the early stages of QE research, methods relied on handcrafted features to predict translation quality, leading to frameworks like QuEst and QuEst++. With the evolution of deep learning technologies, QE methods started leveraging neural networks for more sophisticated modeling. Deep learning-based QE methods can be further categorized into those based on classic deep learning approaches like deepQuest and those integrating pre-trained language models (LMs) like COMET or COMETKIWI.

“LLM-based approaches have the potential to reach state-of-the-art (SOTA) performance levels.”

Lately, QE research has focused on methods based on LLMs. Researchers are exploring the extensive knowledge base and advanced learning capabilities of LLMs to achieve advancements in QE studies by enhancing the accuracy and performance of QE models.

LLM-based Approach: Potential for SOTA Performance

The researchers have identified several applications of LLMs in QE:

  • Utilizing LLMs to directly predict translation quality scores or errors and assess their severity.
  • Employing LLMs as foundation models and fine-tuning them with post-editing data to identify segments requiring post-editing.
  • Create synthetic data with error annotations and explanations using LLMs for fine-tuning LLM-based explainable QE metrics to provide comprehensive error diagnostic reports alongside QE scores. (Note: no human-annotated data is needed)
  • Leveraging the probabilities and the uncertainty of LLMs as quality indicators. A higher probability (i.e, the likelihood of a generated text being probable given the input) suggests the LLM views the generated text as coherent, whereas high uncertainty (i.e, lack of confidence in predicting the next word or sequence of words) indicates the model’s reduced confidence in the correctness of the text generated.
  • Using LLMs to introduce errors in correct translations and create noisy sentence pairs. These pairs, along with the clean (i.e., correct) sentence pairs can be used for training QE metrics to distinguish between accurate and inaccurate translations, assigning higher scores to the accurate one. (Note: no human-annotated data is needed)

While acknowledging that “the performance of LLM-based QE methods has not yet surpassed that of QE methods incorporating pre-trained LMs”, the researchers anticipate that with ongoing research and development, “LLM-based approaches have the potential to reach state-of-the-art (SOTA) performance levels.”

Regarding challenges in QE, the researchers identified LLMs’ potential to address interpretability issues and annotated data scarcity. LLMs can generate synthetic annotated data — crucial for low-resource languages — and identify specific errors and their location in text. “Future research should focus more on leveraging LLMs for enhancing the interpretability of QE,” they said.

However, challenges persist, including the resource-intensive nature of pre-trained LMs and LLMs and the absence of standardized evaluation metrics hampering model performance comparison and integration. Finally, the researchers suggested also that “future research should pay more attention to word-level QE.”

Authors: Haofei Zhao, Yilun Liu, Shimin Tao, Weibin Meng, Yimeng Chen, Xiang Geng, Chang Su, Min Zhang, and Hao Yang