Automatic Metrics Can Outperform Crowd Workers in Machine Translation Evaluation

Google Machine Translation Evaluation

New research from Google shows that professional translators — and even automated systems trained by professionals — rank machine translation (MT) systems very differently compared to the ranking performed by inexperienced crowd workers.

As detailed in the April 29, 2021 paper, “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation,” professional translators show a clear preference for human translation over MT.

Co-authors Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey, all researchers at Google, set out to develop a “platinum standard” of error identification and ranking that can be tailored to the needs of end-users.

For all of its advancements, there is still no universally agreed upon standard procedure for evaluating MT output. The wide range of possibly correct answers (i.e., translations) makes automating evaluation difficult; although tech-forward companies, such as Lilt, are already exploring automated MT review.

Human evaluation, a favorite topic of MT-focused academic research, is not necessarily a silver bullet. Even professional translators can disagree on whether fluency trumps accuracy in a given sentence. 

But past research has shown that other issues arise when crowd workers without a background in translation are hired as a cost-saving measure. Crowd workers are less able to distinguish human translation from MT, especially as MT quality improves, and prefer more literal, “easy-to-rate” translations.

Researchers provided professional translators, all native speakers of the relevant target languages, with access to full-document context for MT output from the top systems of the WMT 2020 shared task for English–German and Chinese–English translation, including human reference translations for each language pair.

The translators based their rankings on the Multidimensional Quality Metrics Framework (MQM), a customizable hierarchy of translation errors. MQM’s fine-grained error categories address accuracy, fluency, terminology, and style.

According to the authors, this is the largest MQM study to date, with professional translators evaluating 1,418 segments for English–German and 2,000 segments for Chinese–English.

Compared to the original rankings for the WMT 2020 shared task, provided by crowd workers, the MQM ratings by professional translators bumped some low-ranked MT systems to much higher positions.

“Unlike ratings acquired by crowd-workers […], MQM labels acquired with professional translators show a large gap between the quality of human and machine-generated translations,” the authors wrote, noting that automatic metrics trained on MQM and informed by professional translators “already outperform crowd-worker human evaluation.”