logo image
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • SlatorCon Remote May 2021
    • Localizing at Scale for International Growth
    • Design Thinking May 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs
MENU
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • SlatorCon Remote May 2021
    • Localizing at Scale for International Growth
    • Design Thinking May 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs

Register Now for SlatorCon Remote on May 13th!

  • Slator Market Intelligence
  • Slator Advertising Services
  • Slator Advisory
  • Login
Search
Generic filters
Exact matches only
Advertisement
How to Fix the 5 Flaws in Evaluating Machine Translation

1 year ago

March 26, 2020

How to Fix the 5 Flaws in Evaluating Machine Translation

Machine Translation ·

by Marion Marking

On March 26, 2020

1 year ago
Machine Translation ·

by Marion Marking

On March 26, 2020

How to Fix the 5 Flaws in Evaluating Machine Translation

No one would argue that machine translation quality has not significantly improved from three and a half years ago. It was back then that Google launched neural machine translation into production by (infamously) describing some of the system’s output as “nearly indistinguishable from human translation.” Experts in the field responded with different views.

Ever since Google’s 2016 claim, many rival machine translation providers, big and small, have proclaimed similar breakthroughs. Now a new study examines the basis of such claims that (as the researchers put it) “machine translation has increased […] to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations.” And it does so by taking a closer look at the human assessments that led to such claims.

The new study, published in the peer-reviewed Journal of Artificial Intelligence Research, shows that recent findings of human parity in machine translation were due to “weaknesses” in the way humans evaluated MT output — that is, MT evaluation protocols that are currently regarded as best practices.

Advertisement

If this is true, then the industry needs to stop dead in its tracks and, as the researchers suggest, “revisit” these so-called best practices around evaluating MT quality.

Human evaluation of MT quality depends on these three factors…

The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation.” Published in March 2020, it was authored by the following: Samuel Läubli, Institute of Computational Linguistics, University of Zurich; Sheila Castilho, ADAPT Centre, Dublin City University; Graham Neubig, Language Technologies Institute, Carnegie Mellon University; Rico Sennrich, Institute of Computational Linguistics, University of Zurich; Qinlan Shen, Language Technologies Institute, Carnegie Mellon University; Antonio Toral, Center for Language and Cognition, University of Groningen.

Slator 2019 Neural Machine Translation Report: Deploying NMT in Operations

Data and Research
32 pages, NMT state-of-the-art, 5 case studies, 30 commentaries, NMT in day-to-day operations
$85 BUY NOW

“Machine translation (MT) has made astounding progress in recent years thanks to improvements in neural modelling,” the researchers write, “and the resulting increase in translation quality is creating new challenges for MT evaluation. Human evaluation remains the gold standard, but there are many design decisions that potentially affect the validity of such a human evaluation.”

What researchers Läubli, et al. did was to examine human evaluation studies in which neural machine translation (NMT) systems had performed at or above the level of human translators — such as a 2018 study, previously covered by Slator, which concluded that NMT had reached human parity because (using current human evaluation best practices) no significant difference between human and machine translation outputs was found.

Human evaluation of MT quality depends on three factors: “The choice of raters, the availability of linguistic context, and the creation of reference translations”

But in a blind qualitative analysis outlined in this new study, Läubli, et al., showed that the earlier study’s MT output “contained significantly more incorrect words, omissions, mistranslated names, and word order errors” compared to the output of professional human translators.

Slator Research Strategy Package - Translation Industry Research

Strategy Package

Market Intelligence
Access all of Slator's subscription services (SlatorSweep, SlatorPro & Research) with a company-wide license and save money.
BUY NOW

Moreover, the study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.”

Choice of Raters

In rating MT output, “professional translators showed a significant preference for human translation, while non-expert raters did not,” the researchers said, pointing out that human assessments typically crowdsource workers to minimize cost.

Slator Visibility Package - Directory Listing and Press Releases

Visibility Packages

Advertising with Slator, Business Development, Marketing
Increase your visibility, build referral traffic and save money by integrating your Press Releases with a Directory listing.
BUY NOW

Professional translators would, therefore, “provide more nuanced ratings than non-experts” (i.e., amateur evaluators with undefined or self-rated proficiency), thus showing a wider gap between MT output and human translation.

Linguistic Context

Linguistic context was also crucial, the study showed, because evaluators “found human translation significantly more accurate than machine translation when evaluating full documents, but not when evaluating single sentences out of context.”

“Professional translators showed a significant preference for human translation, while non-expert raters did not”

While both machine translation and evaluation have, historically, operated at sentence-level, the study said, “human raters do not necessarily understand the intended meaning of a sentence shown out-of-context […] which limits their ability to spot some mistranslations. Also, a sentence-level evaluation will be blind to errors related to textual cohesion and coherence.”

Creation of Reference Translations

As for the third factor, constructing reference translations, the researchers noted how the aforementioned 2018 study used inconsistent source texts as reference — that is, only half of which was originally written in the source language, while the other half was translated from the target language into the source language.

“Since translated texts are usually simpler than their original counterparts […] they should be easier to translate for MT systems. Moreover, different human translations of the same source text sometimes show considerable differences in quality, and a comparison with an MT system only makes sense if the human reference translations are of high quality,” they said.

Slator 2020 Language Industry Market Report

Data and Research, Slator reports
55 pages. Total market size, biz dev and sales insights, TMS & MT review, buyer segment analysis, M&A, Covid impact & outlook.
$480 BUY NOW

Crucially, the new study also found that “aggressive editing of human reference translations for target language fluency can decrease adequacy to the point that they become indistinguishable from machine translation, and that raters found human translations significantly better than machine translations of original source texts, but not of source texts that were translations themselves.”

“Human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems”

What the researchers recommend…

Since, as the study concludes, “machine translation quality has not yet reached the level of professional human translation, and that human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems,” it behooves those that use machine translation to think about making the following design changes to their MT evaluation process:

  • Appoint professional translators as raters
  • Evaluate documents, not sentences
  • Evaluate fluency on top of adequacy
  • Do not heavily edit reference translations for fluency
  • Use original source texts

The researchers end by saying that while their recommendations are intended to increase the validity of MT assessments, they are aware that having professional translators perform MT evaluations is expensive. They, therefore, welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.”

TAGS

Antonio ToralAntonio Toral RuizCarnegie Mellon UniversityDublin City UniversityGraham Neubigmachine translationMTneural machine translationNMTRico SennrichSamuel LäubliSheila CastilhoUniversity of GroningenUniversity of Zurich
SHARE
Marion Marking

By Marion Marking

Slator consultant and corporate communications professional who enjoys exploring Asian cities.

Advertisement

SUBSCRIBE TO THE SLATOR WEEKLY

Language Industry Intelligence
In Your Inbox. Every Friday

SUBSCRIBE

SlatorSweepSlatorPro
ResearchRFP CENTER

PUBLISH

PRESS RELEASEDIRECTORY LISTING
JOB ADEVENT LISTING

Bespoke advisory including speaking, briefings and M&A

SLATOR ADVISORY
Advertisement

Featured Reports

See all
Pro Guide: Translation Pricing and Procurement

Pro Guide: Translation Pricing and Procurement

by Slator

Slator 2020 Language Industry M&A and Funding Report

Slator 2020 Language Industry M&A and Funding Report

by Slator

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

by Slator

Slator 2020 Medtech Translation and Localization Report

Slator 2020 Medtech Translation and Localization Report

by Slator

Press Releases

See all
Protranslate Continues its Substantial Growth in 2021 with its New Enterprise Services

Protranslate Continues its Substantial Growth in 2021 with its New Enterprise Services

by Protranslate

lexiQA Celebrating 5 Years of Continuous Growth

lexiQA Celebrating 5 Years of Continuous Growth

by lexiQA

GET IT Consolidates its Agreement with XTRF to Foster Growth and Ensure Business Excellence

GET IT Consolidates its Agreement with XTRF to Foster Growth and Ensure Business Excellence

by XTRF

Upcoming Events

See All
  1. SlatorCon Remote May 2021

    by Slator

    · May 13 @ 3:00 pm - 8:00 pm

    A rich online conference which brings together our research and network of industry leaders.

    More info $110

Featured Companies

See all
Sunyu Transphere

Sunyu Transphere

Text United

Text United

Memsource

Memsource

Wordbank

Wordbank

Protranslating

Protranslating

SeproTec

SeproTec

Versacom

Versacom

Smartling

Smartling

XTM International

XTM International

Translators without Borders

Translators without Borders

STAR Group

STAR Group

memoQ Translation Technologies

memoQ Translation Technologies

Advertisement

Popular articles

The Slator 2021 Language Service Provider Index

The Slator 2021 Language Service Provider Index

by Slator

Google Translate Not Ready for Use in Medical Emergencies But Improving Fast — Study

Google Translate Not Ready for Use in Medical Emergencies But Improving Fast — Study

by Seyma Albarino

Why Netflix Shut Down Its Translation Portal Hermes

Why Netflix Shut Down Its Translation Portal Hermes

by Esther Bond

SlatorPod: The Weekly Language Industry Podcast

connect with us

footer logo

Slator makes business sense of the language services and technology market.

Our Company

  • Support
  • About us
  • Terms & Conditions
  • Privacy Policy

Subscribe to the Slator Weekly

Language Industry Intelligence
In Your Inbox. Every Friday

© 2021 Slator. All rights reserved.

Sign up to the Slator Weekly

Join over 13,800 subscribers and get the latest language industry intelligence every Friday

Your information will never be shared with third parties. No Spam.