What’s the Difference Between Machine Translation Quality Evaluation and Estimation?

What’s the Difference Between Machine Translation Quality Evaluation and Estimation

Evaluating the quality of machine translation (MT) output is difficult and assessing MT quality at scale is a major challenge for companies looking to translate large quantities of text with short turnaround times.

Language service providers and academia have been reliant on in-depth analysis of machine translation engines based on a “gold standard” segment. Machine translation quality has been measured using a reference metric, such as BLEU, or WER.

This is known as “machine translation quality evaluation”, and, despite its shortcomings, continues to be the modus operandi in research for testing MT systems. 

Human-first machine translation evaluation co-exists with a machine-first, hands-off approach to assessing machine translation output, known as “machine translation quality estimation” (MTQE).

MTQE differs from machine translation quality evaluation in that the machine assesses the quality of MT segments, and it does not rely on manual scoring.

At a practical level, MTQE is already integrated into (some) translation management systems (TMS) to assess the quality of a translation as part of the translation workflow. A TMS applies a translation memory to a given text, it applies machine translation, and then MTQE kicks in, giving the MT segments a score from 0-100.

This drives innovation and efficiency at the beginning of the translation process, as the system decides which segments require an expert-in-the-loop.

Trust in the Machine

On the topic of MTQE at SlatorCon, Conchita Laguardia, Senior Technical Program Manager at Citrix said, “the traditional way of looking at quality management is always very language-based. What you’re asking [the industry] now is actually to trust a machine to tell where the MT has gone wrong.”

Language service providers and localization buyers can access this technology through tools such as KantanQES from KantanAI, through a TMS such as Smartling or Phrase, or via a connector to MTQE specialist providers, namely TAUS or ModelFront.

More recently, ModernMT by Translated has rolled out MTQE functionalities, as well as RWS with RWS Evolve.

With the growing use of large language models (LLMs) in the translation industry, Unbabel has also released the first open-source LLM specifically fine-tuned to predict translation quality, as a development from its predecessor, OpenKiwi, opening up the possibility for further development in MTQE models.