3 Reasons Why Neural Machine Translation is a Breakthrough

Neural machine translation (NMT) reduces post-editing effort by 25%, outputs more fluent translations, and “linguistically speaking it also seems in quite a few categories that it actually outperforms statistical machine translation (SMT).” This comparison opened Samuel Läubli’s presentation during SlatorCon Zürich.

Läubli is a PhD Candidate at the University of Zürich and CTO of machine translation and opinion mining company TextShuttle. During his 20-minute presentation in front of 80 language industry peers at the inaugural SlatorCon Zürich, he discussed three reasons why NMT was a breakthrough. First, however, he wanted to briefly reinforce the foregone conclusion that NMT trumps SMT.

He reiterated that NMT had been widely adopted by huge technology companies over the course of 2017, especially. Focusing on his area of academics, Läubli concluded that “in research at least, there’s no more question then, that NMT is better than SMT systems.”

Of course, that begs the question: why is it better? And indeed, what makes it that much of a breakthrough over predecessor technology?

Watch Samuel Läubli discuss neural machine translation with Slator co-founder Andrew Smart.

NMT Systems Understand Similarities between Words

Noting that his presentation will contain simplifications for a more general audience, Läubli laid out his first reason why NMT was a breakthrough: “NMT systems really have a notion of how similar words are.”

He explained that both SMT and NMT, in the simplest sense, function using numerical substitution—i.e. they replace words with numbers, and then proceed to perform mathematical equations on those numbers to translate.

“NMT systems capture the similarity of words and can then benefit from that”

SMT more or less uses random numbers, in the sense that two related words would have numbers that aren’t related. Läubli gave an example sentence where only one word is different, but used the same way: one sentence used “but” and the other used “except.”

SMT systems, he said, would for example assign values like ID number 9 and 2 to both words respectively, and therefore not relate them in any way. On the other hand, NMT systems would assign values like 3.16 and 3.21, essentially placing them close together if the training data shows their use to be fairly similar.

“NMT systems capture the similarity of words and can then benefit from that,” Läubli said.

NMT Systems Consider Entire Sentences

The next reason why NMT was a breakthrough, according to Läubli, was because of how NMT models assess the fluency of output.

SMT systems would evaluate the fluency of a sentence in the target language a few words at a time using an N-gram language model, Läubli said. “If we have an N-gram model of order 3, when it generates a translation, it will always assess the fluency by looking at n-1 previous words,” he said. “So in this case it would be 2 previous words.”

This means that given a sentence of any length, an SMT system with a 3-gram language model will make sure every three words would be fluent together. “The context is always very local. As new words are added, we can always look back, but only to a very limited amount basically,” Läubli said, noting that if you take a look at SMT output, “you will see that subsequences are actually quite fluent, that’s not a problem. The sentence overall usually isn’t. It’s because of that [limitation].”

Samuel Läubli, SlatorCon Zürich 2017

On the other hand, NMT models use recurrent neural networks. According to Läubli, “the good thing here is that we can condition the probability of words that are generated at each position on all the previous words in that output sentence.” In essence, where SMT is limited to how many words its N-gram model dictates, NMT evaluates fluency for the entire sentence.

“In languages such as German where we have words that have long distance dependencies and so on, this is actually quite important if you want to generate a fluent output sentence,” Läubli concluded. “This is why the output of NMT systems tends to be a lot more fluent than the output of SMT systems if you look at entire sentences.”

NMT Systems Learn Complex Relationships between Languages

For Läubli’s final reason what NMT is a breakthrough, he compared how SMT and NMT systems are trained.

He explained that SMT systems have three separate main components:

  • the translation model that calculates the proper translation for words between languages
  • the reordering model that reorders words in the output, and
  • the language model, which is the N-gram model previously discussed

“The problem here is that these models were all learned independently from each other,” Läubli said. “There really wasn’t any interdependence between them. The problem was of course that for some languages, reordering plays more [of a role] than for others or the language model might be more important and so on.”

He went on to explain that researchers have found some workarounds for this problem, but the models are inherently limited. Some of the adjustments they wanted “just wasn’t possible.”

“In NMT systems, basically we’re looking at one single model, and of course that model has several components. But these components are all jointly trained,” Läubli said. “In this way you can actually capture a lot more interdependencies between the complex features that make up a language if you process them by machines.”

3 Problems with NMT

Not wanting to mislead the audience into thinking NMT is a “solved problem,” Läubli said research into the technology is ongoing. He also highlighted three of its own problems:

First, NMT can only translate on a sentence by sentence basis. “When we translate text what we do is cut the text into individual sentences and then all of these sentences are essentially translated in parallel. When the MT system does something to one sentence it doesn’t know about the others,” Läubli said. “When you get a translated text back out of these engines, it’s basically just a sequence of translated sentences. It’s not a translated text (entire document) in that sense.”

Läubli noted that this is an area of research they are working on at the University of Zürich.

Samuel Läubli, SlatorCon Zürich 2017

Second, NMT needs a lot of training data to become fluent. “In fact we need a lot more translated texts to train a good NMT system than we needed for a good SMT system,” according to Läubli. He pointed out this is not a problem for language pairs with lots of existing corpora, but low resource languages would be an issue.

He noted that this problem is also being tackled by other researchers, mentioning Google’s zero-shot translation.

Lastly, Läubli said he personally sees a problem with NMT that will eventually happen once it becomes widespread enough to the user or translator level: the human-computer interface.

“We need a lot more translated texts to train a good NMT system than we needed for a good SMT system.”

He showed the audience a screenshot of a typical CAT tool that shows six segments of text for translation. “It’s kind of funny that we keep presenting texts in the form of tables to people all the time, in a sense, isn’t it?” He asked, noting that a lot of users have asked if it would be possible to translate entire texts and not segments, or at least translate segments in an order of their choosing. “We want to improve our MT systems by making them look at entire texts, but when people are translating entire texts all they see is 6 segments. This is a bit strange isn’t it?”

“It’s no wonder in my opinion, that 38% of professional translators use MS Word for post-editing,” Läubli said. “They trade in all the translation functionality a CAT tool would offer in favor of a tool that preserves document structure.”

“If you think that we’ll have better MT systems going forward, at some point, people will necessarily want to use some form of machine assistance even for translating literature or marketing texts,” he said. “But if we’re going to show it in interfaces found in today’s CAT tools, I’m not so sure if that’s gonna happen.”

Samuel Läubli, SlatorCon Zürich 2017

Finally, in the panel discussion at the end of all presentations, Läubli fielded a few questions from the audience and Slator. He declined to forecast a timeline for when NMT would be fluent enough to handle stylistic problems like irony, but did answer a question from Slator’s co-founder Florian Faes, who asked how busy the NMT field is right now in terms of research.

“I guess people with a background on MT certainly find it easier these days to find a job… That’s for sure,” Läubli began, noting that MT is a very old problem. “People have been trying to translate automatically from one language into another for like 60 years and they’ve always said in the next five years we’re gonna be there and there we are 60 years later.”

“With NMT, because the techniques are similar to other machine learning applications, this has actually attracted quite a few people who worked on other problems similar to machine translation,” he said. “In that sense, it is a hot topic but it could be that these people go away again in two years and focus on other challenges.”

For a copy of Samuel Läubli’s presentation, register free of charge for a Slator membership and download a copy here.

3 Reasons Why Neural MT is a Breakthrough

7.75 MB


Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.