Here Are the Top 10 Most Influential Research Papers on Neural Machine Translation

Here Are the Top 10 Most Influential Papers on Neural Machine Translation

Roughly half a decade after neural machine translation (NMT) was first deployed by trailblazing language service providers (LSPs) and buyer organizations, it is time to look back at the most innovative research and shed light on how the industry established a new normal.

Slator ranked the most influential research dealing with NMT based on the number of times each paper was cited since publication, averaging citation counts as reported by Semantic Scholar and Google Scholar.

This list focuses exclusively on papers dealing with NMT. So, for example, although the June 2014 paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation was cited over 10,000 times, it is not included here because its subject is statistical, not neural, machine translation.

Note that the latest publication date on the list is 2017; which makes sense considering that, as time passes and the field evolves, it becomes more difficult for research to truly break new ground.

This list is also Slator’s way of bidding adieu to the “neural” in neural machine translation, as most of the industry now refers to NMT as simply MT.

#1 Neural Machine Translation by Jointly Learning to Align and Translate
Citations: ≈14,400
Date Published: September 2014
Authors: Dzmitry Bahdanau (Jacobs University Bremen, Germany), Kyunghyun Cho, Yoshua Bengio (Université de Montréal)

The first NMT models typically encoded a source sentence into a fixed-length vector, from which a decoder generated a translation. Bahdanau, Cho, and Bengio identified the fixed-length vector as a glass ceiling for translation quality, particularly for long sentences. The architecture of their proposed model, RNNsearch, focused “only on information relevant to the generation of the next target word.” The authors described the model’s performance as striking, “considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year.”

#2 Effective Approaches to Attention-based Neural Machine Translation
Citations: ≈4,490
Date Published: September 2015
Authors: Minh-Thang Luong, Hieu Pham, Christopher D. Manning (Stanford University)

Inspired by the integration of attentional mechanisms into NMT, allowing models to focus on select parts of source sentences during translation, Luong, Pham, and Manning explored two potentially useful architectures for attention-based NMT: a global approach, which looked at all source words, and a local approach, which looked at a subset of source words each time. When applied to WMT translation tasks between English and German, both setups were shown to improve translation quality, with the local attention yielding significant gains (as measured by BLEU), and the ensemble model establishing new state-of-the-art results for WMT14 and 15.

LocJobs.com I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. Browse new jobs now.

LocJobs.com I Recruit Talent. Find Jobs

#3 Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Citations: ≈3,250
Date Published: September 2016
Lead Researchers: Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi (Google)

Google’s Neural Machine Translation System (GNMT), touted as producing translations “nearly indistinguishable” from human translations, was designed to scale NMT for work in the real world by decreasing training time, accelerating final translation speed, and improving work with rare words. A beam search technique promoted output sentences more likely to cover all the words in source sentences, and “wordpiece” modeling accounted for morphologically-rich languages. Human side-by-side evaluation of simple sentences showed a 60% reduction in translation errors compared to Google’s previous phrase-based production system. The authors concluded that details such as length-normalization and coverage penalties “are essential to making NMT systems work well on real data.”

#4 On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Citations: ≈3,035
Date Published: September 2014
Authors: Kyunghyun Cho, Bart van Merrienboer, Yoshua Bengio (Université de Montréal), Dzmitry Bahdanau (Jacobs University Bremen, Germany)

Researchers compared two NMT models with different kinds of encoders: one, an RNN with gated hidden units, and the other, a gated recursive convolutional neural network (grConv). Although both models were able to produce correct translations of short sentences without unknown words, the quality suffered as sentences grew longer and as more unknown words were included. “It is important to find a way to scale up training a neural network both in terms of computation and memory so that much larger vocabularies for both source and target languages can be used,” the authors wrote, adding that a “radically different approach” might be required for languages with rich morphology.

#5 Neural Machine Translation of Rare Words With Subword Units
Citations: ≈2,960
Date Published: August 2015
Authors: Rico Sennrich, Barry Haddow, Alexandra Birch (University of Edinburgh)

Back in 2015, NMT models would “back off” to a dictionary upon encountering rare or unknown words. Sennrich, Haddow, and Birch, however, believed there was a way that NMT systems could handle translation as an “open-vocabulary problem.” If various word classes, such as names, cognates, and loan words, were “translatable via smaller units than words,” then encoding such rare and unknown words as “sequences of subword units” could help an NMT system handle them. The researchers looked at several word segmentation techniques, and their subword models showed improvement over a “back-off dictionary baseline” for the WMT 15 English-German and English-Russian translation tasks.

#6 OpenNMT: Open-Source Toolkit for Neural Machine Translation
Citations: ≈1,050
Date Published: January 2017
Authors: Guillaume Klein, Jean Senellart (Systran), Yoon Kim, Yuntian Deng, Alexander M. Rush (Harvard University)

What makes a helpful, open-source toolkit for NMT? For the Systran and Harvard University researchers behind OpenNMT, it was “modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques.” OpenNMT was designed to prioritize efficiency and modularity, with the goal of supporting NMT research into model architectures, feature representations, and source modalities, and providing a stable framework for production use. At the same time, OpenNMT was also meant to maintain competitive performance and reasonable training requirements. 

#7 Improving Neural Machine Translation Models With Monolingual Data
Citations: ≈1,015
Date Published: June 2016
Authors: Rico Sennrich, Barry Haddow, Alexandra Birch (University of Edinburgh)

Targetside monolingual data was already known to help boost the fluency of phrase-based statistical MT, but this paper demonstrated that it could also be an asset to NMT. Researchers trained NMT models by pairing monolingual training data with automatic back-translations, and then treated this synthetic data as additional training data. Since the monolingual training data could be integrated without changing the neural network architecture, the authors believed their approach held promise for different types of NMT systems, but acknowledged that ultimately its effectiveness would depend on the quality of the NMT system used for back-translation and on the amounts of available parallel and monolingual data.

#8 Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Citations: ≈950
Date Published: November 2016
Lead Researchers: Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat (Google)

Google’s “simple” solution to multilingual NMT quickly turned into something much bigger. Researchers used a single NMT model to enable multilingual NMT by introducing an artificial token at the beginning of each input sentence to specify the required target language; the remaining parameters were unchanged and shared across all languages. Their largest models included up to 12 language pairs and allowed for better translation of many individual pairs. What researchers did not expect, however, was for models to learn to perform bridging between pairs never seen explicitly during training, demonstrating, for “the first time to our knowledge,”  that transfer learning and zero-shot translation were indeed possible for NMT.

SlatorCon Remote November 2023 | $98

SlatorCon Remote November 2023 | $98

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

#9 On Using Very Large Target Vocabulary for Neural Machine Translation
Citations: ≈785
Date Published: December 2014
Authors: Sébastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio (Université de Montréal)

For all of NMT’s gains over statistical MT, large vocabulary still posed a challenge; as the number of target words grew, the training and decoding complexity increased exponentially. To make use of a very large target vocabulary without increasing the training complexity, a team of Montreal-based researchers proposed a new method, based on importance sampling, where decoding focused on only a small subset of the whole target vocabulary. Models trained this way matched and sometimes outperformed baseline models with a small vocabulary.

#10 Modeling Coverage for Neural Machine Translation
Citations: ≈520
Published: January 2016
Authors: Zhaopeng Tu, Zhengdong Lu, Xiaohua Liu, Hang Li (Huawei Technologies, Hong Kong), Yang Liu (Tsinghua University, Beijing)

The attention mechanism, credited with boosting state-of-the-art NMT by jointly learning to align and translate, is a bit of a double-edged sword, since it can also ignore past alignment, contributing to over- and under-translation. Feeding a coverage vector to the attention model to help it focus more on untranslated words can mitigate these issues. The two models proposed and explored in this paper — linguistic coverage (which leverages more linguistic information) and NN-based coverage (which resorts to the flexibility of neural network approximation) — both achieved “significant improvements in terms of translation quality and alignment quality over NMT without coverage.”

Finally, read this Twitter thread for a short discussion on the merits of using citation count as a measure of influence.