Can ‘Huge Amounts’ of Synthetic In-Domain Data Improve Machine Translation?

Huge Amounts of Synthetic Data for Machine Translation

With the many noteworthy advances in machine translation (MT) and natural language processing (NLP), it is no wonder that large and small-scale users alike now expect each new MT iteration to measurably outperform its predecessor.

From a functional perspective, MT does get better and better — thanks in no small part to research and all the large datasets freely available for training equally large MT engines. However, domain-specific MT (check out a recent example) remains a work very much in progress.

Researchers Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way from Adapt Center, Dublin City University, National College of Ireland, and Technological University Dublin set out to tackle this domain-specific problem with an experiment using three different setups.

In a paper published in August 2022, this group of NLP specialists defined the problem as “in-domain data scarcity […] common in translation settings due to the lack of specialized datasets and terminology, or inconsistency and inaccuracy of available in-domain translations.”

The researchers also cited lack of adequate computational resources and in-house specialized translation memories as part of the problem. Furthermore, they deem the process of mining open datasets “inefficient.”

MT Fine Tuning and HITL Models

Many researchers have previously worked on the domain-specific MT problem, as the extensive bibliography in the paper shows. Some of the approaches attempted include the use of large monolingual datasets with a further selection of subsets. The latter are automatically forward-translated and subsequently fine tuned.

Other approaches have used fuzzy matches from bilingual datasets followed by further editing and fine tuning; or general MT engine training followed by domain-specific fine tuning.

The researchers present an approach to domain adaptation that still uses pretrained language models, but adds domain-specific data augmentation through back-translation.

The methodology also takes into account certain linguistic characteristics, such as fluency. It uses mixed fine tuning (i.e., additional MT training) and oversampling (i.e., adding samples from the larger corpus to the dataset) to generate what the researchers call “huge amounts of synthetic bilingual in-domain data.”

Their approach, the paper reports, produced “significant improvements of the translation quality of the in-domain test set.” The methodology proposed follows these main steps:

  • Text generation with a large language model in the target language to augment in-domain data;
  • Back-translation to obtain parallel source sentences;
  • Mixed fine-tuning; and
  • Oversampling.
SlatorCon London 2024 | £ 980

SlatorCon London 2024 | £ 980

A rich 1-day conference which brings together 140+ industry leaders views and thriving language technologies.

Buy Tickets

Register Now

The researchers used a baseline setup and two different domain-specific setups. One domain-specific setup is based on a small bilingual dataset; the other, based on a source-only dataset (using forward-translation to generate the target language).

The experiment was conducted using Arabic-to-English and English-to-Arabic language pairs, and the domain chosen was public health. The datasets used were from the Open Parallel Corpus (OPUS) and a domain-specific dataset, the Covid-19 Translation Initiative dataset (or TICO-19).

The outcome was automatically evaluated using numerical spBLEU calculations, a benchmark of 3,001 multiple-domain sentences professionally translated into 101 languages.

The results were also linguistically evaluated by Dr. Muhammed Yaman Muhaisen, native Arabic speaker, subject-matter expert, and ophthalmologist at the Eye Surgical Hospital in Damascus, Syria.

Dr. Muhaisen conducted the bilingual linguistic evaluation on a sample of 50 sentences randomly selected from the original test set. He was asked to use a scale in which quality ranged from 1 (unacceptable translation) to 4 (ideal translation).

The Results

The fine-tuned, domain-specific models generated “more idiomatic translations or better capture the meaning in the public health context,” concluded the scientists.

Since certain expressions can have several valid translations, the human evaluation resulted in the same score for different translations. Results were nonetheless comparable as far as translation quality for both domain-specific setups.

Here are some linguistic examples (English only) mentioned in the paper for comparison.

  • “not pathogenic in their naturally occurring host” (baseline) vs. “nonpathogenic in their natural reservoir hosts” (in-domain). The in-domain translation was deemed more idiomatically correct in the medical context. 
  • “maternity wards” (baseline) vs. “birthing pools” and “birth baths” (in-domain). The baseline was considered a translation error.
  • “serum tests” (baseline) vs. “serological tests” (in-domain). The in-domain translation was considered to be more idiomatically correct.

The scientists also mentioned that in some cases only the baseline and one of the in-domain systems produced an accurate translation. An example of this was the translation into Arabic of “If you do wear a mask,” which was incorrect in two out of the three setups.

What’s Next for Domain-Specific MT?

The scientists concluded that more research is needed using terminology for domain-specific data generation and propose experimenting with this approach for low-resource languages and multilingual settings. More domain-specific datasets would undoubtedly help further prove their usefulness.

Citing the work of other researchers, the group also highlighted the role of back-translation as key to their approach. They added that a study showed that forward-translation might lead to some improvements in quality, but back-translation renders superior results.

The expert-in-the-loop model also continues to prove its value in these research efforts as well as practical applications. This is the case both on the buyer and on the language service provider (LSP) side. Without the qualified criteria of a field expert, NLP scientists would not be able to properly qualify the results of some domain-specific MT experiments.

More empirical data will certainly benefit all types of users, but the question remains as to whether they will have the resources to conduct their own in-house, domain-specific MT experiments.

The paper on this domain-specific MT experiment is among those chosen for this year’s conference of the Association for Machine Translation in the Americas (AMTA), which takes place from September 12–16, 2022.