For years, machine translation (MT) experts and aficionados have agreed that the quality of MT is improving, with some (not unbiased) sources suggesting that certain systems now approach human parity.
A new study from the National Institute of Information and Communications Technology in Kyoto, Japan calls those claims into question by examining, at scale, MT quality evaluation standards.
Authors Benjamin Marie, Atsushi Fujita, and Raphael Rubino manually annotated 769 research papers, sourced from the Association of Computational Linguistics (ACL) Anthology. The papers, released from 2010 to 2020, included the keywords “MT” or “translation” in the title, and compared at least two MT systems.
The resulting research, Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers, identified several major issues with MT evaluations. One pitfall — the reliance on BLEU over all other automatic metrics — could be argued as a root cause for other weak points.
The widespread use of BLEU in MT evaluation was no surprise, but the researchers were struck by its hegemony: 98.8% of the annotated papers reported on BLEU scores, with 74.3% of papers using only BLEU. The proportion of papers using BLEU scores exclusively also seems to be increasing, as papers released earlier in the decade were more likely to include TER or METEOR scores.
Ironically, there is no shortage of alternatives to BLEU — some of which the authors describe as easier to use, more reproducible, and more aligned with human judgment. At least 108 new metrics were introduced between 2010 and 2020, but 89% of them have never been used in any Anthology paper; except when researchers propose their use.
“We assume that MT researchers largely ignore new metrics in their research papers for the sake of some comparability with previous work or simply because differences between BLEU scores may seem more meaningful or easier to interpret than differences between scores of a rarely used metric,” the researchers explained.
The added, “BLEU may also be directly requested by reviewers, or even worse, other metrics may be requested to be dropped.”
Reporting only BLEU scores feeds directly into another issue: comparing scores copied from other papers without reproducing the results. Comparing copied scores saves time and money, and it seems the temptation has become harder to resist. While copying scores (mostly BLEU) from previous work was rare before 2015, 40% of papers released in 2019 and 2020 did so.
SacreBLEU as Band-Aid
The problem is that BLEU and other automatic metrics are composites of several parameters, so a comparison of copied scores can be misleading. SacreBLEU, introduced in 2018, was the only tool used in the annotated papers to standardize metrics to compare scores. At best, the tool is a band-aid for questionable practices, and the authors found several instances in which it was misused.
Comparing copied scores seems to lead researchers to skip statistical significance testing, and instead draw conclusions from results without checking whether scores are coincidental. At most, up to 65% of papers in one year performed this testing, and the meta-evaluation showed a sharp decrease in statistical significance testing since 2016. (This trend has also been observed in NLP papers in general.)
Papers have instead mainly labeled results “significant” based on the amplitude of the difference between metric scores, even though significance is independent from amplitude. More to the point, the authors wrote, there is no scientific consensus on the meaning of differences between BLEU scores.
Datasets in MT papers that copy and compare metric scores must be exactly identical so researchers can identify the cause of a change in scores. If not, they cannot pinpoint the source of the change, whether the new method, the change in datasets, or a combination of the two.
Questionable Evaluation Practices
The meta-evaluation also found an increasing proportion of MT papers (38.5% in 2019–2020) drew conclusions on the superiority of a particular method or algorithm while also using different data — claims that could not be supported by the evaluations. Few papers made use of pre-processed, publicly released MT datasets.
“An increasing number of publications accumulate questionable evaluation practices,” the authors concluded, noting that the pitfalls are “relatively easy” to avoid and have all been described in previous work.
Their solution is a “clear, simple, and well-promoted” guideline used by authors, and a scoring method for reviewers.
While the proposed guideline and scoring method do not cover all the issues in MT evaluation — such as reliance on the same language pairs, especially those including English — the researchers believe it will provide better, if not flawless, evaluation.