As machine translation (MT) improves, it becomes more critical to find a way to measure incremental gains. There is a general consensus that existing evaluation methods do not fit the bill; and, in fact, can even promote the much-maligned simplistic “translationese” that MT tends to produce. Researchers in both academia and industry, therefore, continue to explore alternatives.
Power player Google’s recent research includes work in MT evaluation, such as extending BLEURT beyond English, and the creation of a large-scale knowledge base for 18 languages, for use without reference translations.
A March 2020 paper coauthored by researchers at some of the most prestigious MT research institutions, however, concluded that reference translations are a critical factor in evaluating MT. Now, an October 2020 Google paper explores how human-paraphrased reference translations can improve not just MT output, but MT systems as a whole.
The crux of the issue, according to the paper, is achieving more natural-sounding translation. Current evaluation metrics such as BLEU, however, often reward monotonic and simplistic output. In practice, this means that MT systems can “cheat” by producing the simplest possible translation — not necessarily the highest quality translation — to yield the highest BLEU score.
Lead researcher Markus Freitag told Slator that rather than fix the metric, as many researchers have attempted before, the team at Google decided to look at the problem from a different angle and to try to remove the translation bias.
“Before we can actually improve the MT system, we need to get a better evaluation of translation quality, both automated and human,” Freitag said. “Once we’re happy with the automated evaluation, we want to improve the underlying MT system, which is actually really difficult to improve.”
In Other Words…
Research has shown that when humans paraphrase standard references for use in automated MT evaluation, the automated evaluation correlates better with human judgment. Most notably, it sidesteps the system’s preference for monotonic translations that contain the same words as the reference, resulting in a fairer assessment of alternative, equally good translations.
It sidesteps the system’s preference for monotonic translations that contain the same words as the reference.
Without providing the source sentences, researchers asked professional linguists to paraphrase reference translations as much as possible, which included using different wording and sentence structures, while keeping the reference a natural instance of the target language.
A second group of professional human translators was asked to rate the reference translations, both paraphrased and not, in side-by-side evaluations, again without the source sentences. The vast majority of human translators preferred the paraphrased reference translations, indicating that they were of a higher quality than the MT output.
The researchers then evaluated, step-by-step, the design choices behind the best-performing English–German system from WMT2019, with the goal of determining their impact on standard reference BLEU versus their impact on paraphrased BLEU (referred to in the paper as BLEUP). Those steps included data cleaning, fine-tuning, and back translation.
SlatorCon Remote March 2023 | Super Early Bird Now $98
A rich online conference which brings together our research and network of industry leaders.
Their findings showed that engines optimized for BLEUP made gains in adequacy and fluency when evaluated by humans, and produced noticeably less literal translations. Moreover, as BLEUP scores increased, standard BLEU scores tended to decrease.
“Paraphrased automatic evaluation therefore seems to be a promising proxy for human evaluation when making design choices for MT systems”
“Paraphrased automatic evaluation therefore seems to be a promising proxy for human evaluation when making design choices for MT systems,” the researchers concluded.
There are still use cases where clients might purposely want a simpler, more direct translation, such as content intended for language-learners; pharmaceutical translations; and dubbing or speech translation that needs to match a speaker’s actions as closely as possible.
That said, Freitag considers BLEUP a promising replacement for BLEU, and expects it would be helpful for any content containing longer, well-formed sentences, and across the board for high-resource language pairs.
“A lot of people are actually asking me, ‘Can you provide it for other language pairs?’” Freitag said. “Of course that would be nice, but costly. Even Google is very open to releasing the data and doing a lot of work, but it’s definitely not feasible.”
Ultimately, of course, the goal of industry-backed research is to integrate findings into consumer-facing products. Freitag confirmed that the research behind the paper was basically a test batch for Google’s production system.
“The next step is to incorporate everything into Google Translate and hopefully get a better translation experience,” Freitag said, adding that there are several improvements for high-resource languages in the pipeline. Rather than update the model every two or three months, Google wants to combine the updates and launch them together. “Hopefully, by the beginning of next year we’ll have something in the works.”
Outside of Google’s offerings, Freitag sees an opportunity for this research to spark a reevaluation of past design decisions that ended up favoring simple, monotonic translations.
“The next step is to incorporate everything into Google Translate and hopefully get a better translation experience”
“A lot of decisions we made in past years were actually driven by BLEU,” Freitag said. “I think this is an investment for the future and a lot of other researchers could benefit. It could actually show research done a few years ago, because of BLEU, could be better.”