How to Improve Automatic Machine Translation Evaluation? Add Humans, Scientists Say

Allen Institute Proposes Novel Way to Improve Machine Translation Evaluation

A group of researchers has developed a leaderboard to automate the quality evaluation of natural language processing (NLP) programs, including machine translation (MT). The leaderboard, known as GENIE, was discussed in a January 17, 2021 paper on preprint server arXiv.org.

A leaderboard records automatically computed evaluation metrics of NLP programs. Over time, a leaderboard can help researchers compare apples to apples by standardizing comparisons of newer NLP programs with previous state-of-the-art approaches.

Automatic evaluation of MT is notoriously challenging due to the wide range of possible correct translations. The existing metrics to measure MT, in particular BLEU and ROUGE, fall short by diverging significantly from human evaluations; tuning MT models to maximize BLEU scores has even been linked to biased translations.

More generally, as MT quality has improved and produced more nuanced differences in output, these systems have struggled to keep apace with more sophisticated MT models (SlatorPro). 

It follows, then, that academics and tech companies alike will search for a more efficient, standardized method of human evaluation. (For example, Facebook patented a method for gathering user engagement data to rate MT in 2019.) 

The researchers behind GENIE believe they are on the right path. The group comprises Daniel Khashabi, Jonathan Bragg, and Nicholas Lourie of Allen Institute for AI (AI2); Gabriel Stanovsky from Hebrew University of Jerusalem; Jungo Kasai from University of Washington; and Yejin Choi, Noah A. Smith, and Daniel S. Weld, who are affiliated with both AI2 and the University of Washington.

“We must actively rethink the evaluation of AI systems and move the goalposts according to the latest developments,” Khasabi wrote on his personal website, explaining that GENIE was built to present “more comprehensive challenges for our latest technology.”

Dynamic Crowdsourcing

GENIE is billed as offering “human-in-the-loop” evaluation, which it provides via crowdsourcing. The process begins when a researcher makes a leaderboard submission to GENIE, which then automatically crowdsources human evaluation from Amazon Mechanical Turk.

Once human evaluation is complete, GENIE ranks the model relative to previous submissions. Users can view and compare models’ performance either in a task-specific leaderboard or in a meta leaderboard that summarizes statistics from individual leaderboards.

In addition to MT, there are currently three other task-specific leaderboards: question answering, commonsense reasoning, and summarization.

LocJobs.com I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. Browse new jobs now.

LocJobs.com I Recruit Talent. Find Jobs

The authors encourage researchers and developers to submit new text generation models for evaluation. According to a VentureBeat article on GENIE, the plan is to cap submission fees at USD 100, with initial submissions paid by academic groups. After that, other options may come into play, such as a sliding scale whereby payments from tech companies help subsidize the cost for smaller organizations.

“Even upon any potential updates to the cost model, our effort will be to keep the entry barrier as minimal as possible, particularly to those submissions coming from academia,” the authors wrote.

Reporting a Gold Standard

Of course, GENIE has a ways to go before becoming ubiquitous in NLP. The authors acknowledge that their system will require “substantial effort in training annotators and designing crowdsourcing interfaces,” not to mention the costs associated with each.

Procedures for quality assurance of human evaluation have also yet to be finalized. In particular, the researchers note that human evaluations are “inevitably noisy,” so studying the variability in human evaluations is a must.

Another concern is the reproducibility of human annotations over time across individuals. The authors suggest estimating annotator variance and spreading annotations over several days to make human annotations more reproducible.

Besides standardizing high-quality human evaluation of NLP systems, GENIE aims to free up model developers’ time; instead of designing and running evaluation programs, they can focus on what they do best. As a “central, updating hub,” GENIE is meant to facilitate an easy submission process with the ultimate goal of encouraging researchers to report their findings.