Researchers Report Significant Progress in Real-time Machine Translation

Despite rapid advancements in machine learning technology, Google concedes that machine translation still commits mistakes humans would never make. Add to that the challenge of real-time input and the problem becomes that much trickier.

The use cases for real-time machine translation range from consumer applications, such as Skype Translator, to adaptive machine translation tools that promise significant productivity increases for professional linguists.

In a paper published October 3, 2016, researchers say they were able to show “for the first time” that certain algorithms could “perform simultaneous translation very well, much better than previous segmentation-based algorithms.”

“The final target of this research is speech,” Graham Neubig tells Slator. Neubig, an assistant professor at Carnegie Mellon’s Language Technologies Institute, worked on the study with University of Hong Kong PhD student Jiatao Gu and Chair Professor Victor O.K. Li, and New York University assistant professor Kyunghyun Cho.

Graham Neubig
Graham Neubig

Explains Neubig: “Simultaneous machine translation is technology to translate sentences in real-time as they are spoken or typed. Taking the example of speech, translating before the end of full sentences is important because it can take a speaker as many as 10–20 seconds for the speaker to complete a sentence, which means that it will take that long to even start delivering translated content to the user. This lag means that it is difficult to, for example, fluently participate in a multi-party conversation mediated by speech translation technology.”

We’re definitely interested in handling speech in the future, and this is on our plate of things to do—Graham Neubig

Previously, one way to solve the lag, according to Neubig, was to chop up the input into shorter segments rather than whole sentences, and then translate them independently of each other. If a good place to segment a sentence was found (“for example, between phrases that can be translated separately from each-other”), then lag was reduced. It was faster, but it also decreased fluency.

What makes this study different, however, is it uses a neural machine translation (NMT) framework (Figure 2), which “automatically learns when to start translating words and when to wait for more input.”

Imagine, if you will, an NMT system that (1) waits for a translator to type a word. Then it (2) tries to generate a translation of the next word based on all the words typed so far. And then, based on the neural network’s current state (“and our confidence in the next translation,” says Neubig), it will automatically decide whether the word should be turned into output or wait for additional input.

“If the answer is ‘yes, output the word,’ then output the word and return to 1. If the answer is ‘no, we’re not confident enough,’ then stop output and return to 2,” Neubig says.

He adds that, for the system to work properly, they had to ask themselves: How can we devise the appropriate machine learning algorithm for the task? How do we define the trade-off between translating expediently and translating accurately? How do we appropriately search for the best translation?

“The answers to these questions are the meat of the technical content in the paper,” Neubig says.

We demonstrate for the first time that these algorithms are able to perform simultaneous translation very well, much better than previous segmentation-based algorithms—Graham Neubig

He points out, “In our experiments, we demonstrate for the first time that these algorithms are able to perform simultaneous translation very well, much better than previous segmentation-based algorithms. We think that the main reason for this is that our method remembers all of the previously input words and considers all of them when choosing the next word to translate, which was not easy with previous segmentation-based methods.”

What follows are key excerpts of Slator’s interview with Graham Neubig.

Kyunghyun Cho
Kyunghyun Cho

Slator: In Chapter 6, you say that simultaneous interpretation is a typical use case in related work, but your paper focuses on text input instead of voice input first. What is the main real-life, use case driving this research? 

Neubig: The final target of this research is speech. In this work we handled text [as] it is easier to do starting out; because there are additional things we need to consider when handling speech, such as the additional uncertainty of speech recognition results. We’re definitely interested in handling speech in the future, and this is on our plate of things to do.

Slator: Why did you choose to focus on this particular use case within NMT?

Neubig: First, because it is an important problem for speech translation. Second, because it is a problem where NMT was a very good fit. NMT works by predicting the next word in the sentence and outputting them one at a time—which is what we want in a simultaneous MT system. There are also a lot of interesting algorithmic considerations here as well.

Slator: Was the language combination German into English (Figure 1) chosen specifically because of the length of the distance between subject and verb?

Neubig: Yes, this was a major consideration in choosing this language pair. Previous work on simultaneous translation has focused on pairs with lots of reordering, such as German-English and Japanese-English, for this reason.

Victor O.K. Li
Victor O.K. Li

Slator: What happens if, in German into English, the model chooses a verb that turns out to be a clear mistranslation once the actual verb appears at the end of the sentence?

Neubig: This is a very interesting question that we haven’t considered yet. Actual human simultaneous interpreters will go back and correct themselves, but there is no mechanism for this currently.

Slator: What impact do you expect from the research and what follow-up work will you do?

Neubig: We hope that the eventual impact from this research will be speech translation, where you don’t have to wait long periods of time to get smooth, fluent output. Of course, this work is still just a step in this direction, and considerations like how to integrate the current method with speech recognition systems is something that will have to be tackled before we can make this happen.

Slator: You credit tech giants Facebook, Samsung, Google, Microsoft, and Nvidia at the end of the paper? Can you tell us why?

Jiatao Gu
Jiatao Gu

Neubig: These companies have given either Kyunghyun or Graham research gifts to pursue research either related specifically to simultaneous NMT, or NMT in general. While we obviously can’t speak for the companies, I think they are interested in providing funds to academia to promote research and education in areas they feel promising. They may or may not be interested in this particular project.

Slator: In particular, what is Nvidia’s interest in funding such a research? Are GPUs deployed for neural networks, AI, and so on, already such a big driver of their business?

Neubig: I think they are certainly excited about machine learning using GPUs; but, of course, again, we can’t speak for them.

Florian Faes contributed to this article.

MT practitioners interested in these types of real-time applications are welcome to contact the researchers.