‘In Burmese It’s Just Not as Good’ – Facebook’s Language Tech Challenge

Facebook COO Sheryl Sandberg and CTO Mike Schroepfer took the stage at Code 2018 in May, and faced tough questions from the inimitable Kara Swisher and Peter Kafka on everything from automation and growth at Facebook to cyber security concerns, content review and regulation in the wake of the Cambridge Analytica and fake accounts scandals.

When audience members later had the chance to address Sandberg and Schroepfer at the media and tech conference in California, one attendee zoomed in on news that the Sri Lankan government had blocked access to Facebook in response to outbreaks of violence in the country stirred up by fake news posted on the platform. The question was raised as to what Facebook is doing on the ground internationally to prevent similar situations from arising.

Schroepfer acknowledged that part of the challenge for Facebook was in making sure there are “people on the ground in the country who understand the landscape, the cultural landscape, the nuances of the languages, the NGOs to work with, the folks to work with there, to help understand where the issues are and where we need to intervene.”

As well as scaling up on the people side, Schroepfer said they are also exploring technological solutions.

“A lot of the AI tools we’ve built require large amounts of training data…and that training data is readily available in the bigger languages. But in languages like Burmese, it’s just not as good.” — Mike Schroepfer, CTO, Facebook

On how to improve provisions for low-resource languages, the CTO said Facebook is looking at “how to take…a classifier in one language like English and transmute it over to a language of very little data like Burmese so we can immediately deploy some of the technology we’ve built for other languages there.”

During his testimony to the senate hearing in May, Facebook CEO Mark Zuckerberg had also faced specific questions on enhancing security and content controls for low-resource languages Burmese and Rohingya, after violence in these regions.

Zuckerberg said that “over the long term, building A.I. tools is going to be the scalable way to identify and root out most of this harmful content. We’re investing a lot in doing that, as well as scaling up the number of people who are doing content review”. Zuckerberg specified that the team of people working on security and content review would be scaled up to over 20,000 before the end of 2018.

Given recent events, content review is a highly-visible high-stakes challenge for Facebook’s exec team. There have even been reports of a language service provider offering to team up with Facebook to combat the problem. Facebook is “doing all these paths in parallel because we want to solve this as quickly as we can,” Schroepfer said.

Accordingly, Facebook’s AI Research (FAIR) team has been busy with machine translation (MT) research in recent months, publishing a number of papers on sub-topics including post-editing, intrinsic and extrinsic uncertainty, with a particular focus on dealing with low-resource languages.

The research into post-editing has explored the premise that, for very simple interactions, Facebook users can click to access better translations. Intrinsic and extrinsic uncertainty investigates the notion that substantial gains in output can be achieved by a few tweaks to the training data. For low-resource languages, Facebook is using a combination of phrase-based statistical MT (PBSMT) and neural MT (NMT) to create new engines that use unsupervised learning, i.e. training with no parallel corpora.

Focus on Low-Resource Languages

The most recent paper published by FAIR on the Arxiv repository in June 2018 looks at scalability for NMT. They explored “how to train state-of-the-art NMT models on large scale parallel hardware” and found that “future hardware will enable training times for large NMT systems that are comparable to phrase-based systems.”

In May 2018, Facebook launched an open source AI framework, recognizing the fact that low-resource languages and scalability are the two major hurdles facing the social networking giant as it endeavors to reliably safeguard the platform and execute some 6bn translations daily. The efforts being poured into ongoing research is testament to the sheer size and importance of the language challenge facing the world’s largest social media company.

Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.

[Image: Mike Schroepfer]