Measuring Machine Translation Quality in the Era of Neural

As neural machine translation (NMT) is becoming the new standard, quantifying quality gains for the new technology is proving more and more of a challenge.

In a recent paper submitted to, Professor Andy Way, Deputy Director of ADAPT Centre for Digital Content Technology, unpacked quality expectations for machine translation (MT).

Instead of heavily technical research, Way discusses quality evaluation for MT, and how and why this is an important issue to address as NMT continues to develop as a major industry-changer.

“Companies often overlook how disruptive a technology MT actually is: it impacts not just technically trained staff, but also project managers, sales and marketing, the training team,

finance employees, and of course post-editors and quality reviewers,” Way said in his paper. “All of this should be taken on board beforehand if the correct decision is to be taken with full knowledge of the expected return on investment, but in practice it rarely is.”

“Companies often overlook how disruptive a technology MT actually is” — Professor Andy Way, Deputy Director of ADAPT Centre for Digital Content Technology

For NMT, one of the major concerns is Bilingual Evaluation Understudy (BLEU), the long-standing automatic evaluation metric used in majority of research.

BLEU’s Limitations

BLEU emerged as a de facto automatic evaluation system because of prevalence: the easiest way to show gains in MT research is using the same scoring used by previous ones.

When it comes to NMT, however, the improvements over predecessor MT—not to mention the differences in design (i.e. NMT usually runs on character-level encoder-decoder systems)—makes BLEU even less suited to quantifying output quality.

Aside from the issue of BLEU comparing MT output to a single reference human translation, Way illustrates the limitations of BLEU more concretely through a sample reference translation and sample MT outputs.

The reference translation is: “The President frequently makes his vacation in Crawford Texas.”

The MT outputs are:

  1. George Bush often takes a holiday in Crawford Texas
  2. holiday often Bush a takes George in Crawford Texas
  3. George rhododendron often takes a holiday in Crawford Texas

Way notes that A and B and C would get the same BLEU score, due to inherent limitations in how BLEU calculates scores.

He proposed that the best way to address MT output is to consider two factors:

  1. Fitness for purpose of translations
  2. Perishability of content.

In his own words: “how will the translation be used, and for how long will we need to consult that translation?”

Demand for NMT Quality Metrics

Way went on to explain in his paper that “n-gram-based metrics such as BLEU are insufficient to truly demonstrate the benefits of NMT over [phrase-based, statistical, and hybrid] MT.”

He explained that existing research on NMT’s gains over predecessor tech shows significant improvements in various areas, and yet somehow overall BLEU score increases only reach around 2 BLEU points.

Additionally, on human-machine interaction, Way says MT and translation memory (TM) fuzzy matching is already a common tool in a human translator’s arsenal, so much so that it “compels MT developers to begin to output translations from their MT systems with an accompanying estimation of quality that makes sense to translators.”

In that regard, “while BLEU score is undoubtedly of use to MT developers, outputting a target sentence with a BLEU score of (say) 0.435 is pretty meaningless to a translator.”

Furthermore, this affects pricing and pay. “Translators are used to being paid different rates depending on the level of fuzzy match suggested by the TM system for each input string,” Way writes in his paper.

Finding Ways to Quantify Quality in an NMT-Driven Industry

Way notes that since many NMT engines are character-level systems, evaluation metrics such as ChrF [proposed by Maja Popović in 2015] which operate at the character level become more appropriate.”

Slator reached out to Popović, Researcher at DFKI – Language Technology Lab, Berlin, as a subject matter expert for our NMT 2018 report. Asked about BLEU, she said “BLEU reached its limit for any translation, not only NMT.”

Popović places a vote of confidence for character-based scores “such as BEER, chrF and characTER… for their potential for MT evaluation.”

She also told Slator she is looking forward to the incorporation of linguistic information into NMT systems, “because I believe linguistic knowledge is important, one way or another.”

Other experts in the field provided their outlook on quality assessment for NMT, including Yannis Evangelou, Founder and CEO of linguistic QA company LexiQA, who illustrated a process for NMT split into three stages: pre-translation, machine translation, and post-editing.

Other respondents in the Slator report such as Jean Sellenart, CTO of Systran, Mihail Vlad, VP Machine Learning Solutions of SDL, and even NMT research pioneer Kyunghyun Cho of New York University agreed with Way’s point in his paper about MT output quality being measured against the context of the scenario being used.

Vlad offered some examples:

  1. Quality for post editing is measured by the translator being more productive.
  2. Quality for multilingual eDiscovery is measured by the accuracy of identifying the right documents.
  3. Quality for multilingual text analytics is measured by the effectiveness of the analyst in identifying the relevant information.
  4. Quality for multilingual chat is measured by the feedback rating of the end customer.

Pavel Levin, researcher at, believes that in the near future, the standardization of NMT quality assurance might be as fragmented as demand: “We will be seeing practitioners rolling out their own metrics which are more relevant to their problems (e.g. metrics related to handling of particular named entities, scores from custom QA systems, potentially machine learning based, etc.) and use several of them in combinations.”

In his paper, Way writes that “If NMT does become the new state-of-the-art as the

field expects, one can anticipate that further new evaluation metrics tuned more precisely to this paradigm will appear sooner rather than later.”

Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.