4 months ago
December 2, 2019
BLEU Has Measured Machine Translation Quality Since 2002. It’s Fast Becoming Useless
In the world of machine translation (MT), human evaluation remains the de facto gold standard for assessing translation quality. But for researchers and developers cycling through hundreds of MT system iterations, human evaluation is simply too slow and too expensive to use for each incremental tweak. The solution: automated metrics, which enable researchers to compute a numerical score that expresses translation quality.
Since its introduction into the field of linguistics in 2002, the Bilingual Evaluation Understudy, better known as BLEU, has become MT’s most widely used metric. It has inspired a number of spin-offs, such as METEOR and ROUGE. BLEU and other precision-based metrics operate by comparing MT output to reference translations.
According to a September 2018 paper by Johns Hopkins University research scientist Matt Post, BLEU became the dominant metric in MT research due to its (relative) language independence, its ease of computation, and its reasonable correlation with human judgment.
BLEU’s correlation with human judgment, however, has recently been called into question.
As University of Zurich PhD candidate Mathias Müller explained to Slator in an interview, confidence in BLEU scores has been shaken by significant gains in the quality of MT systems.
“There are now top-performing systems, which are rated as the best translations by humans, but these systems don’t have the best BLEU scores,” Müller said.
This trend became apparent at the 2019 Conference on Machine Translation (WMT19), through its annual news translation task, but applied only to certain language pairs: Chinese into English, English into Geman, German into English, and Russian into English.
“The best systems at WMT19 made BLEU redundant” — Mathias Müller, PhD Candidate, University of Zurich
“Compared to some other language pairs in the WMT translation task, those are relatively well-resourced language directions, all of them, compared to, for example, Lithuanian–English, Kazakh–English, so there’s more data,” Müller said, noting that certain language pairs, such as English–German, have also been included in the news translation task for a number of years, while some others were added more recently.
Still, based on these results, Müller was confident enough to state at an MT meetup in Zurich on October 2019 that “the best systems at WMT19 made BLEU ‘redundant’.”
Müller is not the only expert who finds BLEU lacking. Marcin Junczys-Dowmunt, a Principal NLP Scientist with Microsoft’s Machine Translation team, describes himself as a “power user” of BLEU, who uses the metric to decide how to change models. Junczys-Dowmunt told Slator that over the past two years, trusting BLEU blindly when working with high-quality systems has become problematic.
He said, “This is a situation which I think was mostly seen by us big MT providers in industry, [and] to a lesser degree in academia.” Junczys-Dowmunt explained that industry powerhouses like Microsoft tend to build and try to maintain high-quality systems over a number of years, while academic researchers generally build less robust, ad hoc systems for experiments that focus on specific phenomena.
“Anyone who has successfully published a paper on machine translation has some grasp of the problems,” Bojar told Slator.
WMT19’s news translation task highlighted BLEU’s limitations as related to MT system quality, but some of BLEU’s shortcomings are inherent to the metric itself.
Precision metrics like BLEU score everything they can see in the translation output and confirm with a reference translation, but do not give any credit for output that is not included in the reference translation. This means that potentially substantial portions of MT output — Bojar and his colleagues have estimated up to one-third — go unscored and unaccounted for.
BLEU is also very sensitive to different forms of words, leading to punitive scoring of correct translations in morphologically rich languages where word endings change depending on case.
Up until now, Bojar said, “these problems, collectively, were still not big enough that the community would completely refuse BLEU.” But MT may soon reach the point where improved quality calls for a new metric.
Junczys-Dowmunt points to two factors in particular that have led to sudden significant gains in MT quality: the arrival of the Transformer and the exponential increase in scale.
“We have seen this interesting investment from BERT, direct lessons [on] how to build better systems, from large-scale MT models,” he said. “This has already been seen in this year’s WMT results. The large-scale models did vastly better than the competition.”
Experts have been designing alternate metrics for years, but none has quite caught on like BLEU — yet. Why the enduring commitment to BLEU and its variants?
Müller noted that the downward trend observed for certain language pairs in the WMT19 news translation task is not universal for all language pairs, or for all metrics. Moreover, not all research is focused on developing top-performing systems, so in many research scenarios the correlation between BLEU scores and human judgment is fine.
The usability of new metrics can also be a hindrance.
“There’s always a trade-off. BLEU is very easy to compute [and one can] get a result in a couple of milliseconds,” Müller explained. “If a metric is better than BLEU but very cumbersome to use, then people might shy away from this metric.”
“This perspective of BLEU becoming useless is actually not that scary. This is happening because of increased quality” — Marcin Junczys-Dowmunt, Principal NLP Scientist with Microsoft’s Machine Translation Team
Some newer metrics include contrastive evaluation, which Müller has been exploring. Contrastive evaluation focuses on how an MT system handles specific linguistic phenomena, such as pronouns or noun-verb agreement, and is meant to complement BLEU and other metrics that give an overall impression of translation quality.
ChrF, in Bojar’s mind, is about as simple as BLEU and works around some of BLEU’s limitations. Like BLEU, chrF works sentence-by-sentence and compares MT output to reference translations, but it looks at character sequences, which helps it recognize different word forms.
Like other experts, though, Bojar thinks the next groundbreaking metric will need to operate at the document level. “For BLEU to be definitely abandoned, [we] need some document evaluation,” he said.
No one can say for sure when BLEU might be replaced or by what — or whether BLEU might not be replaced by another metric at all.
“This perspective of BLEU becoming useless is actually not that scary. This is happening because of increased quality,” Junczys-Dowmunt said. “It might just be that our systems are becoming so good that we don’t need to measure them automatically.”