After machine translation became neural machine translation around 2016, academia and big tech research groups began publishing papers implying machine translation had reached human-level quality (whatever that means). Simplified headlines in tech publications followed, researchers argued journalists cherry-picked quotes, and Slator tried to make sense of it all by asking the experts.
“Achieving human-level translation” is so 2018, though. In 2020, it has become “outperforming human-level translation.” That claim was made in a paper published on September 1, 2020, about CUBBITT, a new Transformer-based deep-learning system. The authors include Google’s Łukasz Kaiser, Jakob Uszkoreit of Google Brain Berlin, and Ondřej Bojar at Charles University in Prague.
Okay, let’s start with the caveats. The paper’s title only mentions reaching translation quality “comparable” to human professionals in the domain of “news” translation. And the claim of outperforming humans is reserved for the “adequacy” metric. Still, the claim of “outperforming humans” on any metric other than speed seems new.
Furthermore, the authors conceded that “highly qualiﬁed human translators with [an] inﬁnite amount of time and resources will likely produce better translations than any MT system.”
“The quality of professional-agency translations is not unreachable by MT”
They added, however, that “many clients cannot afford the costs of such translators and instead use services of professional translation agencies, where the translators are under certain time pressure. Our results show that the quality of professional-agency translations is not unreachable by MT, at least in certain aspects, domains, and languages.” An interesting take on the professionalism of linguists who earn their living from translation.
Sentence-level Translation Turing Test
So how did CUBBITT’s supposed outperformance come about? The study defined adequacy as “adequately expressing [the source text’s] intended meaning in the target language.” The assertion that CUBBITT outperformed human translation, therefore, means that human evaluators rated CUBBITT’s translations as representing the source text’s meaning better than the human reference translations: 52% of CUBBITT’s sentences scored higher than the human translations; 26% of CUBBITT translations were scored lower than human translations.
Using the same source documents and translations from CUBBITT’s winning performance on the WMT18 news translation task, 15 human evaluators rated the quality of almost 8,000 sentences across 53 documents. Unlike the news translation task, however, evaluators were provided document-level context for the translations. This allowed evaluators to catch errors that might not have been evident without context, such as a gender mismatch or the incorrect translation of an ambiguous expression.
Compared to the human reference translations, the authors observed that “CUBBITT made significantly fewer errors in addition of meaning, omission of meaning, shift of meaning, other adequacy errors, grammar, and spelling.” On the other hand, CUBBITT made signiﬁctantly more errors due to cross-sentence context (as the researchers anticipated), and human translation was still rated as more fluent.
“CUBBITT made significantly fewer errors in addition of meaning, omission of meaning, shift of meaning, other adequacy errors, grammar, and spelling.”
The group also conducted a “sentence-level translation Turing test” by showing evaluators 100 pairs of sentences, each consisting of a source sentence and a translation. Participants then identified each translation as produced by either a human or by MT. CUBBITT translations were less likely to be identified as MT than translations produced by Google Translate.
Contributing to Human-Likeness
“One potential contributor to human-likeness of CUBBITT could be the fact that it is capable of restructuring translated sentences where the English structure would sound unnatural in Czech,” the authors posited, crediting CUBBITT’s training on back-translation data.
To overcome the lack of English–Czech parallel data for training, the researchers used back-translation, translating more widely available monolingual target language data into the source language. The resulting sentence pairs comprise additional synthetic parallel training data, which are traditionally mixed together with authentic sentences in random order.
CUBBITT is “trained with back-translation data in a novel block regime (block-BT), where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data.”
The authors noted that back-translation can sometimes have the inadvertent benefit of improving the fluency (and sometimes adequacy) of the final translations, “since the target side in back-translation are authentic sentences originally written in the target language.”
The English–French and English–Polish versions of CUBBITT attained BLEU results consistent with those of the English–Czech version. Document-level evaluations suggest that CUBBITT performs best on articles related to business and politics, and performs the worst on articles about art, entertainment, and sports.