The Race for Open Source Neural Machine Translation

Neural machine translation (NMT) often figures prominently during SlatorCon events, and SlatorCon London held at Nobu Hotel in London Shoreditch on May 17, 2018 was no exception. In his presentation for the event, Jean Senellart, Global CTO of event partner Systran, discussed an aspect of NMT that he found both exciting and scary at the same time: the race for open source.

Senellart briefly went through the history of 50-year old machine translation company Systran, a company that experienced and was directly involved in production-level deployments of all MT technologies—from rules-based MT to statistical MT to NMT.

Advertisement

He also spoke about the success of Open NMT, the open source NMT framework Systran and Harvard University built hand-in-hand, giving the audience an update on French company Ubiqus joining their venture.

Since its launch in early 2017, OpenNMT developed into the second largest open-source NMT project with 18 major releases, 3300 stars and 1020 forks on Github, and 6 complete code refactorings.

And this is where Senellart touched upon the core of his presentation: “We are talking about five thousand lines of code. We are talking about something huge and something tiny at the same time.”

NMT Changed MT History

When Senellart said he was talking about something huge, he was generally referring to how NMT has radically changed MT history.

In his presentation, Senellart showed that rules-based MT took to production in 1968 and stayed dominant until 2007, when statistical MT became good enough for production. Then in 2016, essentially a two-year-old technology would take over very quickly.

“SMT was created in in the 90s by IBM. It took 15 years to come to industry-level production,” Senellart said. “NMT was introduced by the academia in 2014. It took two years to be adopted by the industry.”

Aside from the massive difference in pace of development and industry adoption, Senellart also noted how each technology differed in what was considered its main asset. In rules-based MT, the asset was the code and the linguistic resources accumulated. For statistical MT, the asset was the data.

“The more data you have the better criteria you had and the equation was very simple,” Senellart said. “Double the data, and you were getting one more BLEU [Bilingual Evaluation Understudy] point.” He also noted that the first attempts at systematizing MT evaluation began during the reign of statistical MT.

Finally, NMT burst into the scene, and with it another asset shift: “We are not talking about big data anymore; we are talking about good data,” said Senellart.

The Good and Bad of Open Source

Aside from the change of mindset regarding data assets, Senellart emphasized that the open source aspect of NMT was also significant. “If you look at the last two years there has been, every month, about two new open source projects for NMT, so it’s incredible,” he said.

While that seems encouraging, Senellart noted that a lot of them are “dying,” i.e. new projects are not being maintained. Even Google would launch a new open source project only to abandon its maintenance in favor of a new technology or development, reflecting how fast NMT technologies evolve.

Senellart also called attention to the fact that while most open source projects are from the academia, the ones with the most activity are from industry players. Google, for instance, handles the biggest open source project with the most activity, and then second to that is Systran’s own Open NMT. Third in the list is Facebook.

“If you look at the last two years there has been, every month, about two new open source projects for NMT, so it’s incredible.”—Jean Senellart, Global CTO, Systran

This is “odd,” Senellart noted, because prior to this, Big Tech players like Google, Amazon, and Salesforce did not have an active open source culture. He went on to say that developments in technology were usually followed by published papers, often found on research repository Arxiv.org.

“There are very few players that are not open; that are not open sourcing their projects,” Senellart said, naming deepL, Omniscient, and Microsoft as some of them. They do release their “numbers,” however—like report cards, they release how well their NMT engines perform using measurements like BLEU.

So this is part of the good side of open source: collaboration even among competitors.

According to Senellart’s numbers, in 2017 there were 250 publications regarding NMT. “No company in the world can reproduce 250 papers just to check if they’re right or wrong and it is one of the reasons of the necessity of open source today,” he said.

In fact, Senellart noted that NMT tech has evolved so fast that in 14 months, there have been three major paradigm shifts in terms of the technology used. First researchers used recurrent neural nets (RNNs), then they flocked to Facebook-led convolutional neural networks (CNNs), and finally, Google’s self-attentional transformer models.

Senellart painted an interesting parallel between how the technology evolved and how humans process language and translation. RNNs process translation sequentially, word per word. CNNs process more generally, looking at sequences of words. Finally, the attention-based approach literally pays more attention to certain parts of text that may have significant impact to understanding and translating it.

“No company in the world can reproduce 250 papers just to check if they’re right or wrong and it is one of the reasons of the necessity of open source today.”

Then of course, with the good came the bad, and where the open source race helped speed up development, it also meant active players had to “fight for survival,” according to Senellart.

“An open source project is very fragile,” he said, explaining that Systran had to support Open NMT’s users and community, share data and even failed experiments, fix issues, make everything stable and compatible, among others.

“I remember one year ago, I received a call from Booking.com who used Open NMT,” Senellart told the audience. “They were just asking me will open NMT be there in one year because we are launching production now and can you guarantee that you’d still be there in one year?”

For a copy of Senellart’s presentation, register free of charge for a Slator membership and download a copy here.

What’s the Finish Line for this Open Source Race?

Reflecting on the paradigm shifts that the open source race accelerated, Senellart said “I’m talking about five thousand lines of code. It’s not as if we have made something huge. Is it small discovery and totally incompatible, which is the hint that we are still at the very beginning.”

“The big question I have is why are we all fighting for this?” Senellart asked about the open source race.

“I think it’s not NMT. It’s bigger than that. The real battle is behind the AI framework that you are using,” Senellart said. These frameworks include Microsoft’s CNTK, Google’s Tensorflow, Facebook’s PyTorch, and Amazon’s Sockeye. Senellart argued that NMT is only the proxy battlefield where players are “fighting to have their framework become the first framework… Because I believe that NMT is the gateway for all the NLP technologies.”

Senellart said NMT is quickly becoming a commodity, “it’s like running water or electricity; it’s everywhere. We need NMT and one question is what will be the winning computing framework in that case?”

Senellart said the industry is going in a specific direction, pointing out the Open Neural Network Exchange or ONNX, a joint, industry-wide effort between Facebook, Amazon, and Microsoft. ONNX is basically a standardization project, according to Senellart, who noted that standardization arises when technologies reach sufficient maturity.

“ONNX will allow you to take systems trained with Tensorflow to run with Caffe, which is a Facebook platform that lets you run a neural network on your mobile. Or you can do that with Sockeye from Amazon and run that on whatever.” Senellart explained.

He said ONNX is probably key to developing industry-wide standardization, but past that is a more important realization about NMT: “we need to realize that we are still at the very beginning of the technology.”

“The big question I have is why are we all fighting for this? I think it’s not NMT. It’s bigger than that. The real battle is behind the AI framework that you are using.”

Senellart briefly enumerated a number of impressive developments recently, including announcements regarding smarter virtual assistants, unsupervised machine learning for low-resource language translation, and touched upon adding document-level context to NMT that does exceeds sentence-by-sentence translation. He said Systran’s clients are looking for domain specialization now as well, what with general purpose.

As for the future of NMT, Senellart said NMT might be able to eventually augment human capabilities not only through increasing productivity and speed, but also through augmenting the way humans learn.

During the panel discussion, Senellart fielded a few questions from the audience, among which was a question in technological maturity: will the industry be using the same level of NMT eventually?

Senellart said yes, eventually the technology will plateau to about the same level across all players, but by then the differentiator will be the training data. As for how language data will factor into systems like zero shot translation, where essentially no bilingual data is required for machine learning, he said training data will still be used to build and model systems. “Unsupervised machine learning [such as what is used in zero shot translation] will probably help to make new language pairs, [translate] small, low-resource languages and probably increase the quality of the existing big ones,” Senellart said, “but data will still be there.”

Join the executive discussion and register early for SlatorCon San Francisco, where the impact of language technology will come squarely into focus in the language industry’s most pivotal year.

Gino Diño

Content strategy expert and Online Editor for Slator; father, husband, gamer, writer―not necessarily in that order.