Claims that machine translation had achieved near-human parity (however it is defined) back in 2016 were met with disbelief. The technology was still far from being able to produce quality equivalent to that of human translators and it was the metrics that were flawed, people were quick to point out.
Love it or hate it, neural machine translation (NMT) became widely adopted across the language industry in the years that followed. It has fundamentally changed the supply chain and disrupted the way humans interact with translation technology, generating significant productivity gains for users.
NMT now underpins parts of the translation workflow, but relatively little is known about how the machine actually understands content or generates output, and why some of the residing quality issues persist.
Two researchers have now shone a light on some of the oddities found in NMT output, exploring unexpected behavior in RNN and Transformer NMT models. In a paper published on pre-print platform arXiv on May 25, 2020, Marzieh Fadaee and Christof Monz from the University of Amsterdam looked into “The Unreasonable Volatility of Neural Machine Translation Models.”
RNNs (Recurrent Neural Networks) are a type of artificial neural network, while Transformer is a deep machine learning model that was introduced by Google researchers in 2017. The latter is the newer and now more prevalent architecture used in machine translation and speech processing.
Fadaee was a PhD candidate at the university and has since become an NLP / ML Research Engineer at deep learning R&D lab Zeta Alpha Vector. Monz, who remains Associate Professor, describes his research interests as covering “information retrieval, document summarization and machine translation” on his LinkedIn page.
The basis for their research is that, although NMT performs well, it is not generally understood how the models behave. Examining the unexpected behavior of NMT could reveal more about its capabilities as well as shortcomings.
During their research, Fadaee and Monz observed that minor changes to the source sentences sometimes resulted in an “unexpected change in the translation,” which in some cases constituted a translation error. Since the models behaved inconsistently when confronted with similar source sentences, they are considered “volatile,” the two explained.
Important to note is that all source sentences, including modified ones, were semantically correct and plausible for the purposes of their experiments.
The researchers performed a series of tests to analyze the translations of the modified source sentences and the types of changes that occurred.
Important to note is that all source sentences, including modified ones, were semantically correct and plausible for the purposes of their experiments. The changes the researchers made to source sentences were minor and limited to the following: removing adverbs, changing numbers (by a maximum of plus five), and inserting common words. They also changed gender pronouns, having been inspired by prior work on gender bias.
One test applied only to changes to numbers in source sentences. For this category of change, it was possible to have multiple variations of the original source sentence (e.g., +1, +2, +3, +4 and +5). Logically, the translations of the changed sentences should only differ to account for the change in number, but researchers found examples of “unexpectedly large oscillations” for both models.
They also looked at deviations from the original translation and classified them as major or minor deviations. The results showed major differences in 18% of RNN translations and 13% of Transformer translations.
Most of the deviations (ca. 70%) were “as expected,” meaning that they were justified by the change to the original source sentence, while unexpected changes included different verb tenses, reordered phrases, paraphrasing, preposition changes, and more. “The vast majority of changes are due to paraphrasing and dropping of words,” the researchers found. Unexpected changes did not necessarily impact translation quality.
Translation quality was tested separately through a manual evaluation by human annotators. Overall, 26% of changes observed for the RNN model impacted translation quality, compared to 19% of those observed for the Transformer model.
In conclusion, the researchers said, “even with trivial linguistic modifications of source sentences, we can effectively identify a surprising number of cases where the translations of extremely similar sentences are surprisingly different.” This means that NMT models are vulnerable to the slightest change in the source sentence, which points to two other potential shortcomings: generalization and compositionality.
Generalization refers to an MT system being able to translate long source sentences that it has not previously encountered. Compositionality is where an MT system combines multiple, simple sentence parts to build a longer, more complex string.
In their view, “the volatile behavior of the MT systems in this paper is a side effect of the current models not being compositional” because the systems clearly do not demonstrate a good understanding of the underlying sentence parts — if they did, they would not generate the inconsistencies observed.
Moreover, Fadaee and Monz said, while NMT models are capable of generalization, they do so without compositionality. As such, the researchers argued that NMT models “lack robustness” and hoped that their “insights will be useful for developing more robust NMT models.”