Unbabel Presents a New Evaluation Metric for Chat Translation

Unbabel Presents a New Evaluation Metric for Chat Translation

In a March 13, 2024 paper researchers from Instituto de Telecomunicações and Unbabel explored the challenges of evaluating machine-translated chats and introduced CONTEXT-MQM, a new LLM-based metric that leverages contextual information to improve the evaluation process. MQM stands for Multidimensional Quality Metrics.

As the researchers explained, automatic metrics have been successful in evaluating translation quality but are not widely used for assessing machine-translated chats. Chat conversations differ from structured news articles as they are unstructured, short, informal, and context-dependent, making it challenging for existing metrics to accurately evaluate them.

In order to investigate how well automatic evaluation metrics capture the translation quality of conversational data, the researchers conducted a meta-evaluation of existing automatic metrics.

Specifically, they utilized MQM annotations from the WMT 2022 Chat Shared Task, which included real-life bilingual customer support conversations translated by automatic machine translation (MT) systems submitted by participants.

Human experts at Unbabel, specifically trained in evaluating customer support content using the MQM framework, assessed the translations. The evaluations were conducted by Unbabel’s team of expert linguists and translators, who considered the complete conversational context during the assessment.

Room for Improvement

They discovered that reference-based metrics, such as COMET-22 and METRICX- 23-XL, outperformed reference-free metrics, such as METRICX-23-QE-XL and COMET- 20-QE, especially for translations in languages other than English, suggesting that there is “room for improvement for reference-free evaluation for assessing translations in languages other than English.”

By incorporating contextual information, the correlation with human judgments improved, particularly for reference-free COMET-20-QE in non-English translations. However, adding context had a negative impact on evaluating translations in English.

The researchers explored two types of contextual information for evaluating translation quality: within and across participants. In a typical chat conversation, there are usually two participants: a customer and an agent. In the case where the text is generated by a customer, it can be preceded by context from previous interactions by the same participant (i.e., the customer) (within) or by considering the context from both participants (i.e., the customer and the agent) (across).

Bilingual Context Improves Evaluation

They also investigated the use of large language models (LLMs) for assessing chat translation quality and introduced CONTEXT-MQM, an LLM-based metric that utilizes context to enhance evaluation. Initial experiments showed promising results in enhancing the quality assessment of machine-translated chats.

“Our preliminary experiments with CONTEXT-MQM show that adding bilingual context to the evaluation prompt indeed helps improve the quality assessment of machine-translated chats,” they said.

The researchers highlighted the potential of using LLMs to evaluate the quality of chat translations with contextual information. Furthermore, exploring alternative prompting strategies to include context across various language pairs and LLMs is deemed necessary for future research, they said.

Authors: Sweta Agrawal, Amin Farajian, Patrick Fernandes, Ricardo Rei, André F.T. Martins