And the Winner Is…ACL 2020 Announces Best Paper Awards

Paper on how to replace machine translation quality standard BLEU wins second place at ACL 2020

The Annual Meeting of the Association for Computational Linguistics (ACL) announced the Best Paper Awards of 2020 via Twitter on July 8.

ACL is also the organizer of the world’s largest natural language processing conference, EMNLP. ACL’s Annual Meeting covers a range of research areas on computational approaches to natural language, drawing thousands of research papers from across the globe. And like all conferences, ACL, has also gone remote in 2020.

The ACL 2020 conference committee narrowed down this year’s 3,088 submissions, selecting 779 papers (571 of them long, 208, short); an acceptance rate of 25.2%.

Best Overall Paper went to Beyond Accuracy: Behavioral Testing of NLP Models with CheckList by Marco Tulio Ribeiro of Microsoft Research; Tongshuang Wu of the University of Washington; Carlos Guestrin and Sameer Singh of the University of California, Irvine.

Among the two Honorable Mentions for Best Overall Paper was one that, in the words of the paper’s authors, “adds to the case for retiring BLEU as the de facto standard metric.” This sentiment echoes that of other experts who, as previously mentioned, believe that BLEU is fast becoming useless.

In fact, although still widely used in academia, BLEU’s reliance on reference text makes it impractical for industry use, as NVIDIA’s Senior Deep Learning Engineer, Chip Huyen told the SlatorCon audience last fall.

In their ACL 2020 winning paper, Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics, University of Melbourne researchers Nitika Mathur, Timothy Baldwin, and Trevor Cohn show that “current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy.”

The Unimelb trio found that outlier systems (or those whose quality is much higher or lower than the rest of the systems), can have “a disproportionate effect on the computed correlation of metrics,” such that “the resulting high values of correlation can […] lead to false confidence in the reliability of metrics.”

When the outliers are removed, they said, “the gap between correlation of BLEU and other ‘more powerful’ metrics (e.g., CHRF, YISI-1, and ESIM) becomes wider. In the worst case scenario, outliers introduce a high correlation when there is no association between metric and human scores for the rest of the systems. Thus, future evaluations should also measure correlations after removing outlier systems.”

It is time to retire BLEU, the authors concluded, and use instead other metrics such as CHRF, YISI-1, or ESIM because “they are more powerful in assessing empirical improvements.”

They end by saying that “human evaluation must always be the gold standard, and for continuing improvement in translation, to establish significant improvements over prior work, all automatic metrics make for inadequate substitutes.”

The other Honorable Mention for Best Overall Paper was Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks by researchers from the Allen Institute for Artificial Intelligence and the University of Washington.After holding its 57th annual meeting in Florence, Italy, last year, ACL originally planned to hold this year’s conference in Seattle, Washington, but moved online due to the Covid-19 pandemic. ACL’s Twitter page (@aclmeeting) was abuzz on the days of the virtual conference, July 5 – 10, 2020, with live tweeters providing updates in many languages.