How to Fix the 5 Flaws in Evaluating Machine Translation

Evaluating Machine Translation Quality - Artificial Intelligence, Natural Language Processing

No one would argue that machine translation quality has not significantly improved from three and a half years ago. It was back then that Google launched neural machine translation into production by (infamously) describing some of the system’s output as “nearly indistinguishable from human translation.” Experts in the field responded with different views.

Ever since Google’s 2016 claim, many rival machine translation providers, big and small, have proclaimed similar breakthroughs. Now a new study examines the basis of such claims that (as the researchers put it) “machine translation has increased […] to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations.” And it does so by taking a closer look at the human assessments that led to such claims.

The new study, published in the peer-reviewed Journal of Artificial Intelligence Research, shows that recent findings of human parity in machine translation were due to “weaknesses” in the way humans evaluated MT output — that is, MT evaluation protocols that are currently regarded as best practices.

If this is true, then the industry needs to stop dead in its tracks and, as the researchers suggest, “revisit” these so-called best practices around evaluating MT quality.

Human evaluation of MT quality depends on these three factors…

The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation.” Published in March 2020, it was authored by the following: Samuel Läubli, Institute of Computational Linguistics, University of Zurich; Sheila Castilho, ADAPT Centre, Dublin City University; Graham Neubig, Language Technologies Institute, Carnegie Mellon University; Rico Sennrich, Institute of Computational Linguistics, University of Zurich; Qinlan Shen, Language Technologies Institute, Carnegie Mellon University; Antonio Toral, Center for Language and Cognition, University of Groningen.

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

44-pages on how LSPs enter and scale in AI Data-as-a-service. Market overview, AI use cases, platforms, case studies, sales insights.
$380 BUY NOW

“Machine translation (MT) has made astounding progress in recent years thanks to improvements in neural modelling,” the researchers write, “and the resulting increase in translation quality is creating new challenges for MT evaluation. Human evaluation remains the gold standard, but there are many design decisions that potentially affect the validity of such a human evaluation.”

What researchers Läubli, et al. did was to examine human evaluation studies in which neural machine translation (NMT) systems had performed at or above the level of human translators — such as a 2018 study, previously covered by Slator, which concluded that NMT had reached human parity because (using current human evaluation best practices) no significant difference between human and machine translation outputs was found.

Human evaluation of MT quality depends on three factors: “The choice of raters, the availability of linguistic context, and the creation of reference translations”

But in a blind qualitative analysis outlined in this new study, Läubli, et al., showed that the earlier study’s MT output “contained significantly more incorrect words, omissions, mistranslated names, and word order errors” compared to the output of professional human translators.

Moreover, the study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.”

Choice of Raters

In rating MT output, “professional translators showed a significant preference for human translation, while non-expert raters did not,” the researchers said, pointing out that human assessments typically crowdsource workers to minimize cost.

Professional translators would, therefore, “provide more nuanced ratings than non-experts” (i.e., amateur evaluators with undefined or self-rated proficiency), thus showing a wider gap between MT output and human translation.

Linguistic Context

Linguistic context was also crucial, the study showed, because evaluators “found human translation significantly more accurate than machine translation when evaluating full documents, but not when evaluating single sentences out of context.”

“Professional translators showed a significant preference for human translation, while non-expert raters did not”

While both machine translation and evaluation have, historically, operated at sentence-level, the study said, “human raters do not necessarily understand the intended meaning of a sentence shown out-of-context […] which limits their ability to spot some mistranslations. Also, a sentence-level evaluation will be blind to errors related to textual cohesion and coherence.”

Creation of Reference Translations

As for the third factor, constructing reference translations, the researchers noted how the aforementioned 2018 study used inconsistent source texts as reference — that is, only half of which was originally written in the source language, while the other half was translated from the target language into the source language.

“Since translated texts are usually simpler than their original counterparts […] they should be easier to translate for MT systems. Moreover, different human translations of the same source text sometimes show considerable differences in quality, and a comparison with an MT system only makes sense if the human reference translations are of high quality,” they said.

Slator 2021 Language Industry Market Report

80-pages. Market Size by Vertical, Geo, Intention. Expert-in-Loop Model. M&A. Frontier Tech. Hybrid Future. Outlook 2021-2025.
$680 BUY NOW

Crucially, the new study also found that “aggressive editing of human reference translations for target language fluency can decrease adequacy to the point that they become indistinguishable from machine translation, and that raters found human translations significantly better than machine translations of original source texts, but not of source texts that were translations themselves.”

“Human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems”

What the researchers recommend…

Since, as the study concludes, “machine translation quality has not yet reached the level of professional human translation, and that human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems,” it behooves those that use machine translation to think about making the following design changes to their MT evaluation process:

  • Appoint professional translators as raters
  • Evaluate documents, not sentences
  • Evaluate fluency on top of adequacy
  • Do not heavily edit reference translations for fluency
  • Use original source texts

The researchers end by saying that while their recommendations are intended to increase the validity of MT assessments, they are aware that having professional translators perform MT evaluations is expensive. They, therefore, welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.”