In their December 2022 paper, Synthetic Pre-Training Tasks for Neural Machine Translation, UCSD researchers Zexue He and Julian McAuley, and IBM’s Graeme Blackwood, Rameswar Panda, and Rogerio Feris, called synthetic data a potentially “promising stepping stone towards relieving the data burden in NMT as well as building accurate and trustworthy MT systems.”
Corpora crawled from the Web, a convenient source of vast amounts of data and a mainstay in machine learning, are linked to issues of copyright infringement and offensive output. (Some of the most egregious examples have been noted in the AI Incident Database).
Using synthetic data can mitigate or eliminate these challenges, while improving the translation quality for low-resource languages.
To pre-train MT models, the team examined two methods, which they refer to as obfuscated data versus synthetic data.
The first technique obfuscates words in natural parallel data: Using German-to-English parallel data as an example, each source (German) word corresponds to its own unique, obfuscated nonsense source word; while each target (English) word also corresponds to a target nonsense token.
The researchers replaced each source and target word with their corresponding nonsense tokens to create nonsense parallel data that retained useful linguistic information, including distributional frequencies, word order, and grammatical structure.
By contrast, the second method, procedural data generation, allowed the researchers to have complete control over certain facets of the data, such as the level of noise.
The team used “permuted binary trees,” generating random sentences, splitting the sentences at random points, and repeating the process for the resulting sub-strings. This synthetic parallel data produced reflects some aspects of the reordering process that occur naturally during translation.
Synthetic Data In Action
According to the authors, their evaluation led them to the “surprising conclusion” that the transfer benefits of pre-training still apply even when pre-training on obfuscated or entirely synthetic data.
Combining obfuscated data with even a relatively small proportion of real-world data (e.g., German-to-English translations) was shown to provide “the majority of the benefit of large-scale regular pre-training.”
Can obfuscated data replicate the benefits of pre-training on regular parallel data? The researchers thought so, writing that “in the end, […] word identity may not be such an important component in a good pre-trained model, since even with an obfuscation ratio of 75% we still see much of the transfer benefit.”
To evaluate synthetic data, the team pre-trained models using two million sentence pairs of synthetic parallel data, and then fine-tuned each pre-trained model with real parallel data for a specific language pair: Myanmar (Burmese) to English; Indonesian to English; and Turkish to English.
While the permuted binary trees were capable of producing “substantial” transfer learning from synthetic pre-training to real-world tasks, the team did not see gains for Turkish to English, which researchers attributed to the much greater volume of fine-tuning data available for the language pair.
“As fine-tuning data size increases, the necessity of transfer learning from pre-training diminishes,” they wrote.
For lower resource languages, pre-training on synthetic data can improve translation quality once the model is fine-tuned for a specific language pair, as the model must have learned representations and structures relevant for translation.
The authors concluded, “Our results show that transfer learning from synthetic pre-training has the potential to help to improve translation robustness for under-represented language pairs in multilingual models.”