The Facebook AI Research (FAIR) team published a new paper to Arxiv.org on neural machine translation (NMT) on February 28. FAIR team members conducted what they claim is a first-of-its-kind analysis on the inner workings of NMT models.
Specifically, they wanted to understand “uncertainty” in NMT, and propose ways to reduce its negative impact on output.
NMT research is picking up again on Arxiv, Cornell University’s automated online distribution system for research, after a temporary slump during December 2017 and January 2018. Indeed, from November 1, 2017 to February 14, 2018, 46 research papers on NMT were published. Four of those focused on the inner workings of NMT as well.
“There is still a lack of understanding of these models,” FAIR’s paper abstract reads. “Our study relates some of these issues to the inherent uncertainty of the task.”
In particular, FAIR researchers point out that NMT contends with two kinds of uncertainty when performing translations: intrinsic and extrinsic uncertainty.
First, the act of translation itself has intrinsic uncertainty in that a source sentence can be translated into a variety of different target sentences, all of which would mean the same thing while being equally adequate and fluent.
Another intrinsic uncertainty in translation arises from lack of context, be it grammatical or cultural. “Without additional context, it is often impossible to predict the missing gender, tense, or number, and therefore, there are multiple plausible translations of the same source sentence,” the research paper reads.
“Without additional context, it is often impossible to predict the missing gender, tense, or number, and therefore, there are multiple plausible translations of the same source sentence”
Aside from these intrinsic uncertainties, NMT also needs to deal with extrinsic uncertainties often pertaining to noise in the training data.
In their research paper, the FAIR team pointed out a few sources of extrinsic uncertainty:
- Augmenting high quality, human translated corpora with “lower quality web crawled data”
- Partial translations in the corpora, and
- Translations of source sentences that are exact copies of the same source sentences instead of actual translations, at least in the data set they used (the English to German and English to French datasets from the Conference of Machine Translation 2014)—researchers called this source copying.
According to the FAIR team, source copying, in particular, was “interesting since we show that, even in small quantities, it can significantly affect the model output.”
The researchers put forward two methods to mitigate these extrinsic uncertainties:
- Remove low scoring sentence-pairs according to a model trained with relevant corpora. The FAIR team used the English to German news-commentary portion of Conference of Machine translation 2017.
- Eliminate the small but high-impact occurences of source copying. The FAIR team used an automated algorithm that prunes parallel sentences that had 50% overlap (indicating a high likelihood of it being a partial or full copy).
The FAIR team also noted that “performance degradation is greatly mitigated” by using both at the same time.
The researchers have open sourced the code to reproduce their analysis, and also released the data collected from their evaluation.
It remains to be seen how Facebook will directly benefit from this research, and there have been no follow-up research or known applications of the FAIR teams previous research on post-editing for NMT using “very simple interactions.”