Hyperbolic? Experts Weigh In on Google Neural Translate

On September 27, 2016, we broke the story that Google had just published a research paper, which claimed a breakthrough in neural machine translation. In a blog post that went live after we published our article, Google announced that the new Google Neural Machine Translation (GNMT) is now powering “100% of machine translations from Chinese to English—about 18 million translations per day.” More language combinations will follow.

This represents significant progress. It is the first time NMT runs in a production deployment at a scale and speed that actually makes it useful to the online public. Google achieves the speed by training the model on GPUs typically used in gaming and then, and that is new, executing it on its Tensor Processing Units, computer chips custom-built for artificial intelligence (AI) applications.

The Google team was clearly excited by the new model’s output. The paper’s title, Bridging the Gap Between Human and Machine Translation, is anything but modest. It claimed, “Our system’s translation quality approaches or surpasses all currently published results,” and, furthermore, that “human and Google Neural Machine Translations are nearly indistinguishable.”

Yes, those quotes were plucked out in isolation and lack the research paper’s 20 pages of context. And granted, Google tempered expectations by pointing to the “relatively simplistic and isolated sentences sampled…for the experiment.”

They also blogged that “machine translation is by no means solved,” and admitted “GNMT can still make significant errors that a human translator would never make.”

Still, the release was a clever example of what happens at the intersection of research and marketing—to quote Iconic Machine Translation CEO John Tinsley—and kicked off a major news cycle with the story being picked up by all the major tech blogs and even Science Magazine.

For technology savvy language service providers and translators, it creates the premise of significant growth, driven by increased translation and post-editing volumes and smarter assistive tools—Daniel Marcu

We went beyond the hype and reached out to a dozen leading researchers and practitioners in the field of machine translation, as well as Google’s Mike Schuster, one of the paper’s lead authors, to understand just how big a deal this is. The following experts responded to Slator’s request for comments:

Diego Bartolome, CEO of  MT specialist tauyou language technology
Kirti Vashee, an independent technology and marketing consultant, formerly of Asia Online
Tony O’Dowd, CEO of KantanMT
Daniel Marcu, Founder of FairTradeTranslation.com and former SDL Chief Science Officer
John Tinsley, CEO and co-Founder of Iconic Translation Machines
Joss Moorkens, post-doctoral researcher at ADAPT Center and Lecturer at Dublin City University (DCU)
Rico Sennrich, post-doctoral researcher in the Machine Translation group at the University of Edinburgh
Juan Alonso, Head of Lucy Software MT development
Gábor Bessenyei, Managing Director at MorphoLogic Localisation
Jean Senellart, CTO of SYSTRAN
Abdessamad Echihabi, VP of Research & Development at SDL
META-NET’s Jan Hajic, Josef van Genabith, Andrejs Vasiļjevs, Georg Rehm
And Mike Schuster, Research Scientist at Google

A Breakthrough?

Describing the Google announcement as “good news for everyone,” Daniel Marcu of FairTradeTranslation.com says, “For researchers, it validates that the neural trend many have embraced in the last two years is worth pursuing. For technology savvy language service providers and translators, it creates the premise of significant growth, driven by increased translation and post-editing volumes and smarter assistive tools. For teams interested in developing their own MT systems, it provides the blueprint for an easier to replicate model—a neural engine has less hidden, non-reported ‘black magic’ than a phrase-based statistical engine.”

Marcu adds that while “training and deployment costs are still a barrier for many organizations, those are rapidly declining as well.”

The only approach that is going to give significant improvements over current capabilities into the future—John Tinsley

John Tinsley of Iconic says the availability of neural MT via the Internet “has substantial positive implications for Google Translate and similar online consumer applications in the short term.”

Tinsley calls NMT “the way forward” and “the only approach that is going to give significant improvements over current capabilities into the future, for all providers.” He qualifies, however, that from a commercial and enterprise perspective, domain- or user-adaptation of MT engines “based on existing statistical and ensemble approaches” remains state of the art.

He says his team has been collaborating with the ADAPT research center to compare NMT capability against Iconic’s production engines in “an apples-to-apples comparison in a collaboration, the results of which will be shared in the next month. The NMT field is active!”

Rico Sennrich of University of Edinburgh is confident in Google’s claim that their NMT system substantially surpasses their phrase-based production system, and notes “this will have a wide impact on the quality” of Google’s translation service. He adds, “Neural machine translation is an exciting new technology that will have a significant effect on the field of machine translation.”

Diego Bartolome of tauyou concurs, saying Google’s research is “not marketing but a very well structured and justified piece. So I guess it will boost investment in NMT.” He says his team has been working on NMT since the start of 2016 and, “for the domain-specific engines we usually work with, we haven’t achieved a quality increase that would justify the investment yet. [We are] using NMT now to further improve SMT output, together with linguistic rules.”

The potential of NMT surpassing the translation quality of PBMT is very real—Tony O’Dowd

KantanMT’s Tony O’Dowd, meanwhile, calls the Google research paper “a pivotal piece” and says it “compliments similar findings from SYSTRAN earlier this month; namely that the potential of NMT surpassing the translation quality of PBMT is very real.”

According to O’Dowd, “Google goes beyond the research from SYSTRAN and they present comparative and empirical analysis to demonstrate the improvements of NMT. Making the bold statement that, in some cases, Human Translation is indistinguishable from NMT, those MT vendors that ignore this field of research will be left behind and perish.”

He points out that NMT still faces several challenges, not the least of which is “the computational overhead in training models, as even modestly sized models can take days, even weeks to train.” O’Dowd adds that, “gaps in domain vocabulary, which lead to unknown words, can produce unpredictable and strange translations.”

O’Dowd shared KantanLabs has its own project called Prometheus headed by Dr. Dimitar Shterionov and Dr. Santanu Pal, “which is currently experimenting with model training on high-speed, GPU-based machines.” He says they will publish their results shortly.

Exciting Times

Jean Senellart of SYSTRAN calls Google’s announcement “great news” and says it “confirms that we are living very exciting times in machine translation history.”

Providing an update on their activities, he says: “As announced at the end of August, we launched our NMT engine called PNMT for Pure Neural Machine Translation, providing translation quality superior to the current state of the art, and in some ways, even to human translation. While research efforts are ongoing to improve the core technology, we are ready to industrialize and customize this technology for in-domain translation. Within two weeks, the first release will be available and a beta test program will begin.”

We are living very exciting times in machine translation history—Jean Senellart

Joss Moorkens of DCU says their experience with NMT shows that, “based on an SMT/NMT comparative evaluation that we are currently carrying out for TraMOOC using educational texts…NMT results in far fewer word-order errors in the target text (i.e., for the EN-DE and EN-PT data we’ve looked at so far) as reported in the recent paper by Bentivogli et al.; but we still see lexical errors, and the qualitative jump is not consistent across all languages (e.g., for EN-RU).”

According to Moorken, “In a ranking exercise for the language pairs above plus EN-EL, we have seen a marked preference for NMT output over SMT across several educational text types and across varying segment lengths; but this is not necessarily reflected in comparative post-editing times.”

He notes that the EN-FR and EN-ES language pairs mentioned in the Google paper “are those considered best-supported by MT in the META-NET white paper reports, so an improvement will make GMT for gisting more useful; and that’s the main impact I expect to see from their move to NMT.”

Juan Alonso of Lucy says that “if the results in the paper are true—and there is no reason to doubt they are not—the MT quality achieved is really impressive. Of course, we are still talking about a paper that presents preliminary results based on a set of 500 sentences for a limited set of language-pairs for which large volumes of training data exist, and with a relatively weak morphology, especially Chinese and English.”

“The comment on the impressive quality, therefore, only refers to what is claimed in the paper and is not a general comment on the potential true quality of the MT system,” says Alonso.

He adds, “Looking broadly at the acceptance of MT in the real world, the main current complaints from MT users are: no terminology consistency, no deterministic output, word omissions, over-generation, and no on-site solution (i.e., due to high hardware requirements). The key question is: Will this new approach address and overcome these type of issues? This remains to be seen with real-life texts.”

We are still a long way from “nearly indistinguishable from human translation”—Kirti Vashee

Independent consultant Kirti Vashee calls Google’s research “Interesting,” but says “we are still a long way from ‘nearly indistinguishable from human translation.’”

Vashee cautions against “overstating the definite and clear progress that has been made.” He says that, “for some reason, this overstatement of progress is something that happens over and over again in MT. Keep in mind that drawing conclusions on a sample of 500 is risky even when the sample is really well chosen and the experiment has an impeccable protocol.”

However, he says that the study “has just raised the bar for the Moses DIY practitioners, which makes even less sense now since you could do better with generic Google or Microsoft, who also have several NMT initiatives underway.”

While Gábor Bessenyei of MorphoLogic says they currently cannot comment in detail as they do not “have sufficient information” on the matter, he points out that “NMT in general is, of course, one of the most promising directions and, for sure, the future. And with the language and hardware resources of Google, it is likely that the results are as promising as claimed.”

For some reason, this overstatement of progress is something that happens over and over again in MT—Kirti Vashee

Abdessamad Echihabi of SDL says, “It is always exciting to see the continuous progress in MT and we welcome any developments in this area. Much like Google, SDL has been actively researching and investing in Neural MT. However, as Google correctly points out, ‘Machine translation is by no means solved’ with Neural MT. One of the main challenges for MT is how to integrate it in solutions that solve real customer problems.”

Echihabi touches on a key point here. A quality jump in Google Translate will delight the internet masses yet highly specialized MT vendors will continue to develop products tailored for enterprise and professional translator use.

Asked about SYSTRAN’s own impending NMT launch in early October CTO Senellart told Slator in a call in early September 2016 that the new technology only changes the core and has no direct impact on product and distribution.

META-NET also weighed in noting how neural approaches have revolutionized research in machine translation. They say, “This summer, at the WMT 2016, neural research systems from the EU-funded research project QT21 outperformed current online MT systems in most of the competitions. It is great to see cutting-edge research developments being reflected in the new incarnations of the online and commercial systems that have contributed so much to bringing automatic translation into our daily lives.”

This is making a strong contribution to overcoming language barriers within our European and global multilingual information societies—META-NET

The group added, “Google’s announcement is timely. We look forward to Google expanding neural MT to smaller, under-resourced, other morphologically rich, syntactically varied and challenging languages. It will be interesting to see the effects of the new approach in the translation and other relevant industries. This is making a strong contribution to overcoming language barriers within our European and global multilingual information societies.”

Where Research and Marketing Meet

DCU’s Moorkens concedes the Google paper shows impressive improvements in processing time, and that, when deployed, NMT should improve the quality of MT for gisting available via Google Translate. However, while he agrees developments in NMT are exciting, he says he would have preferred the less hyperbolic title of “Narrowing the Gap Between Human and Machine Translation.”

We asked Google’s Mike Schuster to reply to Moorken’s take, as well comments by experts that the claim GNMT “approaches the accuracy achieved by average bilingual human translators” is equally hyperbolic.

Schuster responds, “The complete sentence in our paper is, “Using human-rated side-by-side comparison as a metric, we show that our GNMT system approaches the accuracy achieved by average bilingual human translators on some of our test sets,” which is accurate. Figure 6 shows that the side-by-side score distributions of GNMT and Human are roughly comparable.

We need to be careful to avoid hyperbole!—John Tinsely

Iconic’s Tinsley lauds Google for assembling “a top team of researchers, who are doing good research in the area” of NMT, but admonishes everyone not to get ahead of themselves. He says, “This latest report describes incremental improvements, which is a fine accomplishment using some novel approaches; but it’s not groundbreaking. Claims of “near-human quality” should be taken in context—these results are based on the translation of 500 simple sentences from Wikipedia, by their own admission. So we need to be careful to avoid hyperbole!”

Rico Sennrich, on the other hand, says the study performed human evaluation on “relatively simplistic and isolated” sentences, and the claim that Google’s systems “bridge the gap between human and machine translation” simply “raises wrong expectations.” Sennrich quotes Google that machine translation is still far from solved, and adds that the quality of machine translation still “does not reach human translation quality on most text types.”

Independent consultant Vashee, meanwhile, is “very skeptical about the ‘evaluation’” because “humans are not good at comparing more than two sentences side by side.” He describes the announcement as being “more like corporate marketing manipulation of some at least mildly suspect data. This, to me, is a corporate marketing trick to fool gullible Internet minds who don’t understand statistics and test methodology.”

The quality of machine translation still “does not reach human translation quality on most text types—Rico Sennrich

Google Responds

Vashee says, “If you look at the BLEU scores more closely and then look again at the table where humans have rated three or four translations of the same thing side by side—a notoriously unreliable practice by the way—you will see that Google has taken 10% improvements in BLEU and then publicized it as “55% to 85% improvements” in error reduction.”

Google’s Schuster pushes back, “As you can see in the last section of our paper, human evaluation and translation of given sentences is ambiguous. We show the error reduction of side-by-side scores between our new system (GNMT) against humans vs. the old system (PBMT) against humans. Side-by-side scores and BLEU scores are not the same.”

If BLEU scores are compared they have to be compared on a benchmark database with given training and test data—Mike Schuster

Vashee adds that although “NMT is definitely proving to be a way to drive MT quality upward and forward, for now, is limited to those with deep expertise and access to huge processing and data resources. Experimental results like these should be interpreted with care, especially if they are BLEU-score based. Really good MT always looks like human translation. We should save our “nearly indistinguishable” comments for when we get closer to 90% or at least 70% on this.”

Vashee concludes that Google’s results are consistent with what SYSTRAN reported but are “actually slightly less compelling at a BLEU-score level than the results SYSTRAN had.”

To which Google’s Schuster replies, “It seems there are no results in that article that can directly be compared against our results. If BLEU scores are compared they have to be compared on a benchmark database with given training and test data, otherwise it does not make sense to compare results.”

Iconic’s Tinsley says Google’s statement that “systematic comparison with large scale, production quality, phrase-based translation systems has been lacking” is “not quite true.” He adds that it is “not the first work of its kind at this scale. The WMT shared task earlier this year had a comprehensive comparison of machine translation approaches, including NMT, from a variety of providers and research institutions across multiple different languages.”

Schuster answers this by saying: “We show results on WMT (a research data database of modest size, ~4.5 – 36M sentence pairs) and results on our magnitudes larger production databases. We are not aware of other comparisons on that scale. If there are any, please provide a link to the paper.”

Our paper describes the new Google Neural Machine Translation system and obviously builds on many years, even decades of research by other groups—Mike Schuster

University of Edinburgh’s Rico Sennrich tweaked Google by saying: “The paper itself is not a scientific breakthrough, and builds heavily on recent research by other groups. Given the massive scale of the models, and the resulting computational cost, it is in fact surprising that they do not outperform recent published work—unfortunately, they only provide a comparison on an older test set, and against relatively old and weak baselines.”

Sennrich added, “As a side note, our WMT16 submission obtains 26.6 BLEU on newstest2014, as compared to Google’s 26.3; since we did not publish results on this old test set, this does directly contradict their claim, but the numbers give some perspective to the Google system.”

In response Schuster says, “Our paper describes the new Google Neural Machine Translation system and obviously builds on many years, even decades of research by other groups. The comment seems to be about results on the WMT research database. We are not aware of published results that are better than our results on WMT. If there are any, please provide a link to the paper. We used newstest2014 because most other papers have been publishing results using this test set.”

Where Are We Heading?

According to tauyou’s Bartolome, he believes “future systems will combine all technologies depending on the use case.”

Already, says Daniel Marcu, “research in neural modeling has halved the error rates of large scale speech recognition and image understanding systems” within just a few years. He says Google’s latest announcement “fits this trend” in that “it provides irrefutable evidence that neural systems are likely to soon make traditional, statistical MT production systems obsolete.”

The next challenge is to commercialize NMT at a good economic price so it’s affordable—Tony O’Dowd

Vashee concludes that, in the near term, “Adaptive MT is more meaningful and impactful to the professional translation industry; but, as SYSTRAN has suggested, NMT adapts very quickly with very little effort to human correction. This is a very important requirement for MT use in the professional world. If NMT is as responsive to corrective feedback as SYSTRAN is telling us, I think we are going to see a much faster transition to NMT.”

KantanMT’s CEO O’Dowd is optimistic, “The next challenge is to commercialize NMT at a good economic price so it’s affordable—currently a GPU-based machine is seven times the costs of a CPU-based machine. Companies that build a successful business model that delivers NMT at a good economical price will emerge as market leaders. We are witnessing a paradigm shift, which changes all the rules and is challenging our thinking, both operationally and commercially!”

Marion Marking contributed to this story.