Quality estimation (QE) for MT is less accurate than commonly thought, according to a paper published by Facebook researchers on July 6, 2020.
Among QE’s many useful applications, it can be trained to automatically identify and filter bad translations, which can reduce costs and human post-editing efforts. End-users who cannot read a source language can also use QE as a feedback mechanism. In July 2019, researchers at Unbabel published a paper on post-editing styles and said they were looking to apply their work in, among other things, quality estimation.
The paper, “Are we Estimating or Guesstimating Translation Quality?,” is based on research carried out as part of PhD student Shuo Sun’s internship with Facebook over the summer of 2019. Francisco Guzman, a research scientist manager with Facebook’s Language and Translation technologies (LATTE) group, supervised the research, along with Imperial College London professor and fellow LATTE research collaborator Lucia Specia.
The authors identified three main reasons QE is overstated: (1) issues with the balance between high- and low-quality instances; (2) issues with the lexical variety of test sets; (3) the lack of robustness to partial input.
They believe these issues with QE datasets lead to “guesstimates” of translation quality, rather than estimates. The findings were a surprise to Sun, whose project originally aimed to examine whether multi-task learning can be used to learn better neural QE models.
“I ‘accidentally’ discovered the problems because of a bug in my code,” Sun told Slator. “My QE neural models performed well when they were not properly ingesting source sentences.”
How to Deal With Overstated QE
The research team found that QE datasets tended to be unbalanced, often excluding translated sentences with low quality scores. As a result, most of the translated sentences required little to no post-editing.
“This defeats the purpose of QE, especially when the objective of QE is to identify unsatisfactory translations,” the authors wrote. To combat this imbalance, the researchers recommended purposefully designing models to include varying levels of sentence quality.
Lexical artifacts (i.e., a lack of diversity across labels, sentences, and vocabulary) can also lead to a QE system’s artificially strong performance due to the repetitive nature of the content. Sampling source sentences from various documents across multiple domains can provide a more diverse range of material.
The authors also suggested “using a metric that intrinsically represents both fluency and adequacy as labels” when designing and annotating QE datasets.
QE Dataset by Researchers, for Researchers
Building on their own recommendations, the researchers created a new QE dataset, called MLQE. They focused on six language pairs: two high-resource languages (English–German and English– Chinese); two medium-resource languages (Romanian–English and Estonian–English); and two low-resource languages (Sinhala–English and Nepali–English).
For each language pair, 10,000 sentences were extracted from Wikipedia articles on a range of topics to prevent lexical artifacts. These sentences were then translated by state-of-the-art neural models. Finally, they were manually annotated according to direct assessment in order to mitigate issues of sampling bias and a lack of balance between high- and low-quality translations.
“We decided to build an improved QE dataset for the research community the moment we discovered the issues with current QE datasets,” Sun said. MLQE is now available on GitHub and is currently being used for the WMT 2020 shared task on QE.
Sun said neural QE models seem to perform better on mid- and low-resource language directions than on high-resource language directions.
Sun’s next plans include studying QE in zero-shot and few-shot, cross-lingual transfer settings, and experimenting with multilingual QE models that can simultaneously handle multiple language directions.