When Google CEO Sundar Pichai presented Google Translate’s progress after going neural, he boasted that translation accuracy improved from 3.694 (phrase-based) to 4.263 (neural). Pichai was quoted as saying that “human quality is only a step away at 4.636.”
Measuring translation quality down to the third decimal place? What metric did Pichai use? And how is it supposed to measure translation accuracy—whatever that is—with a level of precision more appropriate to the construction of an airplane engine?
“Human evaluations of machine translation are extensive but expensive…can take months to finish, and involve human labor that cannot be reused.” So began the abstract of a 2002 paper by a group of IBM Watson Research Center scientists.
With this problem in mind, scientists Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu presented a new method for the automatic evaluation of machine translation (MT), which would be faster, “inexpensive, and language-independent.” They called this the Bilingual Evaluation Understudy; or simply, BLEU.
The BLEU study was, in part, developed under contract with the US Defense Advanced Research Projects Agency (DARPA), which has funded a number of automated translation efforts.
BLEU’s inventors want it to serve as an “automated understudy” and substitute for “skilled human judges” whenever frequent, speedy evaluations are needed. Fourteen years after the paper was published, BLEU has become the de facto standard in evaluating machine translation output.
Difference Between Human and Machine
The idea behind BLEU is the closer a machine translation is to a professional human translation, the better it is. The BLEU score basically measures the difference between human and machine translation output, explained Will Lewis, Principal Technical Project Manager of the Microsoft Translator team.
In a 2013 interview with colleague Chris Wendt, Lewis said, “[BLEU] looks at the presence or absence of particular words, as well as the ordering and the degree of distortion—how much they actually are separated in the output.”
BLEU’s evaluation system requires two inputs: (i) a numerical translation closeness metric, which is then assigned and measured against (ii) a corpus of human reference translations.
BLEU averages out various metrics using an n-gram method, a probabilistic language model often used in computational linguistics.
The result is typically measured on a 0 to 1 scale, with 1 as the hypothetical “perfect” translation. Since the human reference, against which MT is measured, is always made up of multiple translations, even a human translation would not score a 1, however. Sometimes the score is expressed as multiplied by 100 or, as in the case of Google mentioned above, by 10.
Not an Exact Science
“BLEU scores have a use, though limited,” said John Tinsley, CEO and co-Founder of Iconic Translation Machines. “Even the original developers acknowledged this, but it got caught up in a wave and became something of a standard,” he told Slator.
Rico Sennrich, a post-doctoral researcher for the University of Edinburgh’s Machine Translation group and a pioneer in neural machine translation (NMT), said BLEU is an essential tool for machine translation R&D, particularly for its “very quick feedback on the quality of an experimental system.” But for all this, he said, BLEU has several blind spots.
There is the fundamental fact that language is not an exact science. “There are typically many valid ways to translate the same text, so even a perfect translation may get penalized by BLEU,” Sennrich said.
He also cautioned that BLEU is “poor at measuring the overall grammaticality of a sentence, and may only give a small penalty for a change that is superficially small, but completely changes the meaning of a translation.”
[BLEU] may only give a small penalty for a change that is superficially small, but completely changes the meaning of a translation—Rico Sennrich
CEO of tauyou language technology, Diego Bartolome, cited another weak spot in that BLEU is “not really useful” for post-editing. He said, “For that, we prefer the post-editing distance, [which] doesn’t capture the cognitive effort of a change, but it’s better than BLEU.”
According to Bartolome, BLEU scores also depend on what type of test you run. “If it’s related to the content used to create the engine, you might get an artificially high value.” On the other hand, if it’s uncorrelated, you might get lower than expected results, he said.
In the Ballpark
Another problem with BLEU, Iconic CEO Tinsley added, lies in what the scores actually mean. In practice, Tinsley explained, “We use BLEU scores to give us an intuitive feel of where an engine is at the beginning, and then to benchmark different versions of the engine as we carry out ongoing development. Ultimately, though, this all needs to be backed up with manual evaluation of samples of the test data to ensure that the improvements are actually having an impact.”
According to Tinsley, a BLEU score offers more of an intuitive rather than an absolute meaning and is best used for relative judgments: “If we get a BLEU score of 35 (out of 100), it seems okay, but it actually has no correlation to the quality of the output in any meaningful sense. If it’s less than 15, we can probably safely say it’s very bad. If it’s greater than 60, we probably have some mistake in our testing! So it will generally fall in there.”
In another hypothetical example, Tinsley compared two systems where one has a BLEU score of 32 (A) and the other, 33 (B). B is nominally better, but in reality, Tinsley said, with such a small increment in between the two, it would be very difficult to see a substantial improvement in quality. “Only if the BLEU score is maybe 4 or 5 points [apart] can we safely say there is a meaningful improvement,” he said.
A BLEU score offers more of an intuitive rather than an absolute meaning and is best used for relative judgments—John Tinsley, CEO of Iconic Translation Machines
For Omniscien Technologies CEO Andrew Rufener, BLEU “really doesn’t mean much” to a language service provider using MT. BLEU is useful, Rufener said, provided “you understand it and look at it in context.” Still, he concluded, BLEU scores “will tell you very little about the level of productivity enhancements you will get.”
Neural’s Advent May Spell Trouble for BLEU
As alternative metrics, Tinsley named METEOR, TER (Translation Edit Rate), and GTM (General Text Matcher). According to Tinsley, these have proven more effective for specific tasks (e.g., TER correlates better with post-editing effort). He said, “Most commercial MT providers will use all of these metrics, and maybe more when developing internally to get the full picture.”
Among these other metrics could be TAUS’ DQF (Dynamic Quality Framework), which offers bespoke benchmarking, albeit at a price point. But no matter how bespoke, it is not hard to argue that, as Tinsley pointed out, “There is obviously no substitute for manual evaluations.”
As Rico Sennrich said, “Human evaluations in the past have shown that BLEU systematically underestimates the quality of some translation systems, in particular, rule-based systems.”
As for the use of BLEU scores in the context of NMT, John Tinsley finds it “funny,” especially since, according to him, “they are even less effective for NMT” compared to rule-based systems.
Apart from BLEU, Tinsley said there is a trend among NMT researchers to use character-based metrics (e.g., ChrF, char-TER), “particularly when translating from English into other languages with richer morphology.” The results correlate better, he said.
The main reason for this, according to Tinsley, is state of the art NMT engines produce translations based on sub-words or characters as compared to words in statistical MT engines.
The fundamental comparison point for BLEU is the word…so it’s biased toward statistical engines—Will Lewis, Microsoft Translator team
Microsoft Translator’s Lewis concurred in his interview with Chris Wendt: “The kind of fundamental comparison point for BLEU is the word—and so it’s biased toward statistical engines. Phrase-based engines tend to fare better in BLEU than rule-based engines. So if you have a score of a certain number on a rule-based engine, comparing it against a phrase-based engine isn’t exactly fair because the rule-based engine will be penalized.”
Lewis added: “Typically, if you have multiple [human translation] references, the BLEU score tends to be higher. So if you hear a very large BLEU score—someone gives you a value that seems very high—you can ask them if there are multiple references being used; because, then, that is the reason that the score is actually higher.”
In short, do not take BLEU scores at face value.
Said Sennrich, “The scientific community regularly organizes shared translation tasks, which include a human evaluation.” He concluded, “Even though such an evaluation is costly, it is crucial that we do not trust automatic metrics blindly.”