The 4,000 Lines of Code Harvard Hopes Will Change Translation

OpenNMT should become for neural machine translation (NMT) what Moses is to phrase-based statistical machine translation. So hope the creators of OpenNMT from Harvard and Systran as they launch an NMT toolkit for what they describe as a “widely-applied technique for machine translation.”

The release of the paper outlining the toolkit follows Harvard NLP’s December announcement of the OpenNMT system.

In their paper, Systran Research Engineer Guillaume Klein, Harvard NLP’s Yoon Kim and Yuntian Deng, Systran Chief Scientist Jean Senellart, and Harvard NLP adviser Alexander Rush outline an open-source NMT toolkit that aims to “support NMT research” through “modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques” of the Harvard OpenNMT system.

The keyword is “open-source.” As the paper points out, although there are now several existing NMT models, they are either “closed source…unlikely to be released with unrestricted licenses,” such as those by Google, Microsoft, and Baidu or “exist mostly as research code” (GroundHog, Blocks, tensorflow-seq2seq, lamtram, and Harvard’s own seq2seq-attn).

While the authors acknowledge that these systems serve an important purpose, they add that such systems provide little support for use in an actual production environment.

They call the University of Edinburgh’s Nematus system “most promising,” touting its high-accuracy translation, clear documentation, and use in several successful research projects — and then promptly compare Harvard OpenNMT to Nematus via the inevitable BLEU score yardstick. (Guess which won? See table excerpted from paper.)

The paper notes that “one nice aspect of NMT as a model is its relative compactness”; that is, relative to Moses. The entire OpenNMT system (with pre-processing, the authors note) has around 4,000 lines of code. The Moses SMT framework comes in at over 100,000 lines of code, according to the paper’s authors.

At press time, OpenNMT had garnered 606 stars on GitHub and its creators say there has been “active development by those outside” Harvard and Systran. So it seems like some of the hobbyists Rush mentioned in Slator’s previous coverage are indeed starting to tinker with the system.

The launch of the toolkit comes at a time of intense mainstream interest in the accelerating progress of language technology in general, and neural machine translation in particular. Major news outfits, such as the The New York Times and the The Economist recently reported on these latest developments.