Facebook Engineer on Why Translation Is Still Hard

You know neural machine translation (NMT) is reaching the peak of inflated expectations on Gartner’s Hype Cycle when engineers of giant tech companies proclaim on stage that the end game for the technology they are working on offers nothing short of making the world a better place.

On April 19–20, 2017, Necip Fazil Ayan, Engineering Manager at Facebook, gave a 20-minute update at the F8 Developer Conference about the current state of the art of machine translation at the social networking giant.

Slator reported in June 2016 on Facebook’s big expectations for NMT. Then, Alan Packer, Engineering Director and head of the Language Technology team at Facebook, predicted that “statistical or phrase-based MT has kind of reached the end of its natural life” and the way to go was NMT.

Halfway to Neural

Ten months on and Facebook says it is halfway there. The company claims that more than 50% of machine translations across the company’s three platforms — Facebook, Instagram, and Workplace — are powered by NMT today.

Facebook says it started exploring migrating from phrase-based MT to neural MT two years ago and deployed the first system (German to English) using the neural net architecture in June 2016.

Since then, Ayan said 15 systems (from high-traffic language pairs like English to Spanish, English to French, and Turkish to English) have been deployed.

No tech presentation would be complete without a healthy dose of very large numbers. Ayan said Facebook now supports translation in more than 45 languages (2,000 language combination), generates two billion “translation impressions” per day, serves translations to 500 million people daily and 1.3 billion monthly (that is, everyone, basically).

The Challenges

Ayan admitted that translation continues to be a very hard problem. He pointed to informal language as being one of the biggest obstacles, highlighting odd spellings, hashtags, urban slang, dialects, hybrid words, and emoticons as issues that can throw language identification and machine translation systems off balance.

Another key challenge for Facebook: low resources languages. Ayan admitted Facebook has very limited resources for the majority of the languages it translates.

“For most of these languages, we don’t have enough data,” he said — parallel data or high quality translation corpora, that is. What is available even for many low resource languages are large corpora of monolingual data.

So, Ayan explained, Facebook takes monolingual data (i.e., text), runs it through machine translation, and, voilà, an artificial corpus of bilingual data is created. Apparently, using a large machine-generated parallel corpus to train an NMT system is still better than using a high quality but only small corpus.

The third challenge he sees is doing translation at scale. “We have to train a lot of systems. We have to train them fast, and we need to decode and generate translations fast,” Ayan said.

NMT is run on power-hungry graphic processing units (GPUs) and improvements that make the computation faster and more efficient are very important for a company serving translations to hundreds of millions of users every day.

Ayan said one of the ways to make the system faster is “online vocabulary reduction.” The bigger the vocabulary, the more expensive the computation is,” Ayan pointed out. He then went on to explain how Facebook “reduces the output project layer size” (i.e., ignoring certain words) to make the computation faster.

Ayan concludes by saying that Facebook has made a lot of improvements, but recognizes they still have a long way to go; as a slide behind him reads, “This journey is 1% finished.”

Images: courtesy of Facebook