Meta Claims ‘Breakthrough’ in Machine Translation for Low-Resource Languages

Meta Low Resource Language Machine Translation

Just like his millions of friends on Facebook, Meta founder and CEO Mark Zuckerberg takes to the social network to announce important news. In a July 6, 2022 Facebook post, Zuckerberg explained why Meta AI’s recent No Language Left Behind (NLLB) project merits attention.

Specifically, Meta AI tweeted, the company built an AI model capable of translating between 200 languages — for a total of 40,000 different translation directions.

“To give a sense of the scale, the 200-language model has over 50 billion parameters,” Zuckerberg wrote. “The advances here will enable more than 25 billion translations every day across our apps.”

According to a July 6, 2022 LinkedIn post by Meta AI, the modeling techniques from this work have already been applied to improve translations on Facebook, Instagram, and Wikipedia.

A Meta AI blog post implies that the company aims to integrate translation tools developed as part of NLLB into the metaverse, noting that “the ability to build technologies that work well in hundreds or even thousands of languages will truly help to democratize access to new, immersive experiences in virtual worlds.”

While the paper does not include a list of languages addressed in the project, the NLLB page on GitHub mentions Asturian, Luganda, and Urdu as examples of low-resource languages. The authors — some of whom are associated with UC Berkeley and Johns Hopkins University, in addition to Meta AI — noted that the degree of standardization varied across the languages studied, with an apparently “single” language potentially contending with competing standards for script, spelling, and other guidelines.

Researchers also weighed the potential risks and benefits of the new tools from NLLB for low-resource language communities. They considered the impact on education especially promising, but wondered whether increasing the visibility of certain groups online might make them more vulnerable to increased censorship and surveillance, or exacerbate digital iniquities within the groups.

In preparation for the project, researchers interviewed native speakers to better understand the need for low-resource language translation support. They then created a new dataset to level the playing field for low-resource languages: NLLB-Seed, a dataset composed of human-translated bitext for 43 languages.

The team used a novel bitext mining method to create hundreds of millions of aligned training sentences for low-resource languages. This process entailed lifting monolingual data from the Web and determining whether any two given sentences could be a translation.

Researchers then calculated the “distance” between the sentences in a multilingual representation space using LASER 3, which researcher Angela Fan singled out as a major contribution to improved translation of low-resource languages. Starting with a more general model, LASER, researchers can specialize the representation space to extend to a new language with very little data.

They also employed modeling techniques designed to significantly improve low-resource multilingual translation by reducing over-fitting.

NLLB introduced another innovation: FLORES 200, a high-quality human-translated evaluation dataset. Fan explained that previous SOTA had only evaluated performance on 101 languages using FLORES-101, a many-to-many evaluation dataset from 2021.

The authors reported that their model achieved a 44% improvement in BLEU, thus “laying important groundwork towards realizing a universal translation system.”

But, as to be expected, improvement was not uniform across language pairs, with little to no improvement for pairs such as Armenian into English or French into Wolof.

Having open-sourced their work on GitHub, Meta AI now offers up to USD 200,000 in grants to help nonprofit organizations use NLLB-200.