Working Around Machine Translation’s Need for Large-Scale Training Data

Machine Translation Data Requirements

Historically, more has always been more with regard to training data for machine translation (MT). The reliance on large-scale data, however, is not unique to MT, and now a number of fields in machine learning have seen a shift toward a new framework. So-called “foundation models” still require high volumes of training data, while complementary applications for specific tasks need less input.

In other words, many engineers are now fine-tuning the prompting of small, task-specific datasets.

Speaking at the November 2022 Intelligent Applications Summit, Stanford computer science professor Carlos Guestrin said that complex problems can now be solved with little data, adding, “Big data is not a priority anymore, in my opinion.”

Guestrin may have been referring to transfer learning, in which a model’s “experience” in one task improves its performance in another, separate task.

One instance of transfer learning appeared in Google’s 2016 paper on zero-shot translation: A multilingual MT model translated between two languages trained as part of other pairs. 

Researchers found that zero-shot translation can be improved by continued training on a small amount of language-pair specific data — much like the emerging trend of data-heavy foundation models accompanied by fine-tuned small, task-specific datasets.

SlatorCon London 2024 | £ 980

SlatorCon London 2024 | £ 980

A rich 1-day conference which brings together 140+ industry leaders views and thriving language technologies.

Buy Tickets

Register Now

The year 2022 saw the release of multiple papers exploring the usefulness of multilingual transfer, for example, in improving MT quality for South American Indigenous languages. Other papers asked whether domain mismatch degrades transfer learning across languages and whether domains can be transferred across languages.

But transfer learning, understandably an area of growing interest in MT, is more the exception rather than the rule. Where transfer learning occurs, the multilingual MT model has used the original training data for the first task (rather than additional data) to “teach itself” how to do a secondary task.

MT generally still requires large-scale, properly cleaned data for optimal output. This also applies to multilingual MT models designed to promote cross-lingual transfer learning: Although the model may translate between languages without parallel data connecting them, the model is still fed significant amounts of data for each language as part of other pairs.

To overcome that hurdle, researchers from a variety of institutions have worked on growing datasets through data augmentation (DA). Broadly speaking, the goal of DA is to generate new, high-quality training samples when available parallel data is scarce. 

Imperfect Solutions

A roundup of DA techniques highlights the methods researchers find most promising, as indicated by recent papers.

Back-translation, in which more widely available monolingual target language data is translated into the source language, got a boost in March 2021, when University of Helsinki language technology professor Jörg Tiedemann released a dataset of over 500 million translated sentences covering 188 languages on Github.

Unfortunately, this technique “can yield semantically poor results,” and does not necessarily solve the domain information gap (translation errors for low-frequency and out-of-vocabulary terminology). And according to a 2021 paper released jointly by Facebook, Amazon, Twitter, and University of Melbourne, Australia, systems trained on back-translation can be especially vulnerable to attacks of poisoned monolingual data.

Researchers from IBM and University of California – San Diego described synthetic data as a “promising stepping stone towards relieving the data burden in NMT” in a December 2022 paper examining two techniques. To create obfuscated data, the authors replaced real source and target words with nonsense tokens. They also generated random sentences, which they repeatedly split at random points, resulting in synthetic data.

In addition to improving translation quality for low-resource languages, the researchers wrote, synthetic data could mitigate or eliminate certain challenges posed by training models on Web-crawled text (e.g., data copyright infringement and the generation of offensive output). 

In a twist on the traditional big data challenge, Multimodal MT mines non-text information to improve translation quality. This technique relies on bilingual texts with images, data which is scarce in and of itself, so researchers have experimented with alternatives, namely monolingual image-text data and parallel text-only data.