2 years ago
November 18, 2016
Neural Conquers Patent Translation in Major WIPO Roll-out
Barely two years after it was first proposed, machine translation technology based on neural networks is going mainstream. After Google, Systran, and Microsoft, the World Intellectual Property Organization (WIPO) announced on October 31, 2016 the roll-out of neural machine translation (NMT) on its publicly available translation tool called WIPO Translate.
WIPO, a self-funding UN agency made up of 189 member states, is based in the Swiss city of Geneva.
Work on WIPO Translate began with the open-source, statistical machine translation framework Moses back in 2009. Two years later, the WIPO team had a Moses-based engine ready.
Initially, WIPO Translate was trained to translate between Chinese, Japanese, and Korean patent documents and English as those languages accounted for about 55% of worldwide patent filings in 2014.
Like Google, WIPO has chosen Chinese-English as the trailblazing language combination for its NMT roll-out. That is because in 2015, 14% of all international patent applications were filed in Chinese, according to Francis Gurry, WIPO Director General. “This year, we expect that to go to something like 17% or 18%,” he said on the WIPO YouTube channel.
To roll out the beta version, WIPO trained the tool on a giant corpus of 60 million sentences found in Chinese patent documents from China’s State Intellectual Property Office, which were filed at the US Patent and Trademark Office. Next, WIPO plans to extend the tool’s coverage to patent applications in French, followed by other languages.
To find out more about the NMT production deployment, Slator spoke to Bruno Pouliquen, WIPO Senior Engineer, and Christophe Mazenc, Director of Global Databases Service.
It took probably one or two months to train the model the first time—Christophe Mazenc, WIPO
The rapid roll-out of NMT was a matter of months, recalled Mazenc. “It took probably one or two months to train the model the first time, and then one month to integrate into our systems,” he said.
WIPO partly credits Marcin Junczys-Dowmunt and his technology AmuNMT for the fast deployment. Junczys-Dowmunt is a visiting professor at the University of Edinburgh and a WIPO contractor.
Junczys-Dowmunt’s AmuNMT, Pouliquen explained, is “a tool that can translate very fast using NMT models,” even on a CPU. Within a year, the WIPO engine became more efficient, reliable, and produced better quality output. The team had assembled patent data from Chinese and US patent applications and trained the engine with the open-source tool Nematus, which was developed by Rico Sennrich of the University of Edinburgh, birthplace of Moses SMT.
Pouliquen is confident WIPO Translate beats Google on the narrow patent domain: “I think one key aspect of our tool is we train only on patent text. So our tool is very focused and, therefore, it is better because it is not polluted with other things.”
The narrow focus comes with restrictions. Pouliquen pointed out, “If you try to put an e-mail into our tool, you will see that the result is just disastrous; because it doesn’t know how to translate e-mail, it never learned how.”
He cited what he called an “amazing” example: The tool is unable to translate “I am.” Pouliquen said, “‘I am’ is never seen in any patent application, so the tool doesn’t know how to translate it.
The tool does not know how to translate “I am”—Bruno Pouliquen, WIPO
What drove WIPO to look into NMT soon after the idea to use neural networks for translation was first floated, Pouliquen said, was an awareness of the limits of phrase-based machine translation; limits they quickly found NMT could deal with. He said that, when they tested for BLEU scores, there was “big jump for Chinese into English,” indicating a vast improvement from SMT to NMT.
“The difference in BLEU is very impressive,” agreed Databases Director Mazenc, to which Pouliquen added, “Some translators even told us that it was definitely better in terms of human translator evaluation.”
The difference in BLEU is very impressive—Christophe Mazenc, WIPO
Additionally, Pouliquen said, WIPO was sitting on a huge stockpile of parallel data they could train the MT engine on, therefore, “we are effectively in a good position to be one of the quickest to use the technology.”
Does Size Matter?
Pouliquen says WIPO has more reference data for their domain than even tech giants Microsoft or Google. But does the size of the corpus actually matter in NMT?
Pouliquen said, “Our corpus is so big that even with two weeks of machine-intensive training, we didn’t manage to put all the corpus inside. So it’s a bit early to give you an exact answer on that. The only thing we could say is, it doesn’t harm, definitely doesn’t harm.”
He pointed out that, other than size, the quality and timeliness of the corpus are also important. “It’s quite obvious that a quality corpus is better than a bigger [one]. And it’s also quite obvious that recent data is more important than more data.”
The WIPO engineer explained that, in the patent domain, a tool trained on the latest inventions will get a better model to decode new inventions. “Recent terminology is more important than old terminology,” he said.
Are SMT’s Days Numbered?
Asked whether he thinks SMT will still be relevant in three years, Pouliquen said, “It’s like looking in a crystal ball before doing any experiment. I guess all our own models could be replaced by NMT. But if we see that, for example, Portuguese works better with SMT, we will keep SMT for Portuguese. But I think when we’ve got enough data, NMT might be better.”
I guess all our own models could be replaced by NMT—Bruno Pouliquen, WIPO
WIPO is sharing its machine translation technology with other UN organizations. At UN Headquarters in New York, WIPO’s technology has been integrated into the proprietary translation productivity tool and provides MT suggestions for post-editing when there is no match from the translation memory.
This requires WIPO to train their models on a very different set of data, of course. But translation corpora is not something the United Nations lacks. In May 2016, the UN released its Parallel Corpus, consisting of nearly 800,000 documents or slightly over 1.7m aligned document pairs. Pouliquen promises that UN translators will get to enjoy the benefits of NMT sometime in 2017.
Image: WIPO HQ in Geneva