Following Amazon’s announcement of its own neural machine translation (NMT) offering, the company’s machine learning scientists published a paper on research repository Arxiv.org that details the inner workings of Sockeye, their open-source, sequence-to-sequence toolkit for NMT.
They also showed off benchmark performance comparisons between Sockeye and popular, open-source NMT toolkits using the Conference on Machine Translation 2017 datasets (WMT17) and results.
Amazon unexpectedly singled out language service providers (LSPs) as potential buyers during its late-November launch announcement. Sockeye, however, was introduced several months prior, on July 2017.
Open Sourcing Neural
The Sockeye announcement in July came on the heels of Google’s own July 2017 pitch for TensorFlow, the advanced machine learning model the search giant uses for Google Translate.
In a blog post, Google recognized MT as an active area of research where there was “a lack of material that teaches people both the knowledge and the skills to easily build high-quality translation systems.”
Amazon’s own open-source pitch for Sockeye is pretty much the same, and in its blog post AWS gave readers an idea idea how Sockeye works and how to install, train, and translate right away given a parallel corpus of one million pairs of translated sentences from WMT2017.
And what about the actual paper submitted to Arxiv months after the fact showing off Sockeye’s performance compared to currently popular NMT systems?
“Such papers are necessary when releasing a toolkit that you want to gain wider adoption within the community,” said Dr. John Tinsley, MT expert and CEO of Iconic Translation Machines.
Amazon Flexing Its Muscles
The December 2017 paper reported Sockeye’s performance as a production-ready framework that can train and apply any of the three most-used NMT models currently. These are recurrent neural nets with attention mechanisms (e.g. Harvard and Systran’s system used by Booking.com), self-attentional transformers (such as Google’s TensorFlow), and convolutional neural networks (Facebook’s Fairseq).
In a sort of delayed entry to WMT2017, the paper compares Sockeye’s performance against the WMT 2017 newstest evaluation set results on two language directions: English to German and Latvian to English. The researchers used BLEU (Bilingual Evaluation Understudy) to score performance.
Tuning Sockeye to use recurrent neural nets with an attention mechanism, Amazon researchers claim the system beat both the Lua and Python implementations of OpenNMT, the Torch-based NMT toolkit developed jointly by Systran and Harvard University.
However, Sockeye could not displace top scorers Nematus and Marian, NMT systems both developed by the University of Edinburgh.
Next, the researchers put a self-attentional transformer model on top of the Sockeye framework. They reportedly managed to barely edge out two other transformer systems: Tensor2Tensor, a sequence-to-sequence model based on Google’s Tensorflow, and Marian, again from the University of Edinburgh.
Finally, the researchers also used convolutional neural nets on Sockeye, reportedly besting the framework used by the Facebook Artificial Intelligence Research (FAIR) team, Fairseq.
“What this paper shows is that Amazon have certainly developed a very competitive framework, which is not surprising given the breadth of MT talent they have in-house,” Tinsley said.
He noted, however, that at the pace NMT has developed over the past year, “the dust hasn’t yet settled on which tool might ultimately become the standard (a la Moses in SMT). We’re some time from that point yet.”
Scored with a Questionable Yardstick
Amazon’s researchers might claim Sockeye is competitive to current NMT models, but it is worth noting that their basis—BLEU scores—offer a limited view of actual translation fluency.
“Human assessments are the only reliable yardstick,” said Tinsley. “People continue to report BLEU scores nearly because they have to, despite mounting evidence that they’re less effective with NMT.”
In an December 2016 article about BLEU as a metric, Tinsley suggested other existing scoring systems better suited for NMT, such as METEOR, TER (Translation Edit Rate), and GTM (General Text Matcher).
He acknowledged that while it is best that everything undergo human assessment, “that doesn’t work / scale in the research scenario.” He did remain hopeful that new, more applicable metrics will emerge out of research soon.
“At the end of the day, something more akin to confidence estimation might be ultimately more effective for NMT as opposed to comparison against reference translations,” Tinsley said.
“While initial focus might have been on their internal translation needs, it’s certainly interesting to see how they’re positioning the service,” Tinsley said. “It appears to be closer to Microsoft Translator rather than Google, in that they’re aiming at business use cases, but it’s still an out-of-the-box offering with a limited number of languages and no ability to customise.”
“It remains to be seen how far they go with it,” he concluded.