Everything old is new again — including Google’s latest machine translation (MT) research. Co-authors Ankur Bapna, Orhan Firat, Yuan Cao, and Mia Xu Chen, who collaborated on a July 2019 paper presenting the culmination of five years’ work on a “massively multilingual” MT model, were joined this time around by Aditya Siddhant, Isaac Caswell, and Xavier Garcia.
Google’s January 2022 paper, Towards the Next 1,000 Languages in Multilingual Machine Translation, again takes up the cause of universal translation, addressing the challenge of scaling a massively multilingual model by training more parallel data. In addition to the prohibitive cost involved in collecting and curating parallel data for so many language pairs, this solution is typically unhelpful for many low-resource languages with limited data.
“Beyond the highest resourced 100 languages, bilingual data is a scarce resource often limited only to narrow domain religious texts,” the authors wrote. To build and train an MT model that covers more than 200 languages, Google researchers employed a mix of supervised and self-supervised objectives, depending on the data available for languages.
This “pragmatic approach,” as described by the authors, can enable a multilingual model to learn to translate effectively, even for severely under-resourced language pairs with no parallel data and little monolingual data. Moreover, they wrote, the results of their experiments “demonstrate the feasibility of scaling up to hundreds of languages without the need for parallel data annotations.”
Conceptually, the researchers explained, “one could think of this as monolingual data and self-supervised objectives […] helping the model learn the language and the supervised translation in other language pairs teaching the model how to translate by transfer learning.”
Pragmatic though it may be, the design is not new, ModelFront CEO and co-founder Adam Bittlingmayer told Slator, with “almost all competitive systems” now using some target-side monolingual data, even for major language pairs.
However, Bittlingmayer added, “it is in contrast to the recent publications from Facebook on this front.” For Facebook’s M2M-100, designed to avoid English as an intermediary between source and target languages, researchers manually created data for all pairs, while the social networking company snagged a November 2021 WMT win by focusing exclusively on translation to and from English.
Parallel or Monolingual Data?
The Google team performed two experiments, the first using parallel and monolingual data from the WMT corpus to train 15 different multilingual models, for 15 languages to and from English. Each model omitted parallel data for one language, simulating a realistic scenario in which parallel data is unavailable for all language pairs.
For each language, researchers then compared the performance of the “zero-resource model” (i.e., without parallel data) to a multilingual baseline trained on all language pairs using all parallel data available via the WMT corpus.
For high-resource languages, this setup was able to match the performance of fully supervised multilingual baselines, but it was not enough to help the lowest-resource languages in the study (e.g., Kazakh and Gujarati) achieve high-quality translation. Adding monolingual data for those languages had a significant positive impact, improving translation quality above that of a supervised model.
“Even for high-resource languages, the method can achieve similar translation quality by leaving out parallel data entirely (for the language under evaluation) and throwing in 3–4 times monolingual examples, which would be easier to obtain,” the researchers wrote.
The team found that adding zero-resource languages in the same model diminishes performance across languages, while adding more languages with parallel data helps in all cases, since an unsupervised language learns something from each supervised pair. In the same vein, a lack of parallel data seems to be slightly more detrimental to translation quality, compared to a lack of monolingual data.
Kenneth Heafield, Reader in MT at the University of Edinburgh, told Slator that these findings are not particularly surprising. “Using all the available data, parallel and monolingual, is usually best, provided it is clean,” he said, adding that of course, there are exceptions, such as extreme cases of domain mismatch: “Trying to translate software manuals when your only parallel data is the King James Bible is difficult.”
While high-quality, the WMT dataset is relatively small and covers a limited number of languages. To scale the model to cover more than 200 languages, the researchers conducted a second experiment, starting with a highly multilingual crawl of the web for monolingual and parallel data.
They cleaned up the noisy dataset for the 100 lowest-resource languages to use for back translation. The cleaner version of the monolingual data was then translated into English, generating synthetic data for the zero-resource language pairs.
In this scenario, the authors wrote, “We find that xx→en and en→xx translation quality exhibit different trends.”
Translation quality into English did not correlate well with the amount of monolingual data available for the non-English language; rather, the languages that performed well were typically those with similar languages in the supervised set. (In this context, the languages are not necessarily similar from a linguistic perspective, but have similar representations and labels learned within a massively multilingual MT model).
BLEU scores for English translation into other languages were high only for languages with high into-English translation quality, as well as relatively large amounts of monolingual data.
While the paper did not provide a timeline for when Google Translate users might benefit from this research, there is certainly widespread demand.
“On the product side, at this point, our median fellow human — more than four billion of us — is an Internet user and does not understand English. And there is a content explosion,” Bittlingmayer said. “So there is just a strong pull from the market, even if spend lags views.”