How to Improve Machine Translation Quality Evaluation with Domain-Specific Data

How to Improve Machine Translation Quality Evaluation with Domain-Specific Data

Predicting the quality of machine translation (MT) output is of utmost importance in translation workflows, and quality estimation (QE) models are essential tools for achieving this goal. However, the performance of QE models depends on the availability and quality of training data, which is often scarce due to the high cost and effort associated with labeling such data. Moreover, these models often fail to handle data from different domains — both generic and specific — making it challenging to achieve satisfactory results.

To address these challenges, Javad Pourmostafa Roshan Sharami, Dimitar Shterionov, Frédéric Blain, Eva Vanmassenhove, Mirella De Sisto, Chris Emmery, and Pieter Spronck from the Department of Cognitive Science and Artificial Intelligence at Tilburg University have developed a new methodology to boost the performance of QE models. Specifically, they proposed in a 2023 research paper a method that uses a small amount of domain-specific data to improve overall QE prediction performance.

As the authors explained, “this approach is inspired by work on domain adaptation in the field of MT, where a large generic model is initially trained and then fine-tuned with domain-specific data,” and aims to enhance the performance of QE models by improving their generalizability across diverse domains. 

The Right Balance for Generalizability

The authors emphasized that training QE models requires a dataset consisting of post-edited text, the source, and machine-translated text. However, such datasets remain scarce across language pairs and have limited coverage across domains. In fact, “there is less publicly available data for training QE systems as compared to MT systems,” the authors told Slator. This can pose a challenge for all QE models, especially recent ones that utilize large language models (LLMs) since fine-tuning pre-trained models with small datasets has been demonstrated to be quite unstable.

Furthermore, QE models trained on specific data do not generalize well to other domains outside of the training domain, leading to significant decreases in their performance. “To improve the generalizability of QE models, it is important to establish the right balance between domain-specific and generic training data,” explained the authors.

QE Model Training 

The proposed method involves training a generic QE model first and then fine-tuning it on a specific domain. As the authors explained to Slator, the benefits of this process are twofold: firstly, as generic data is usually of large volume, a robust model can be developed through initial training, and secondly, by fine-tuning the model using a much smaller dataset, it can be adapted to a specific domain — or even style — while retaining the original robustness.

More specifically, in the first step, a generic QE model is trained using out-of-domain data until it converges. This step leverages the LLM’s cross-lingual reference capabilities and builds a generic QE model. “This way, we ensure that the model can estimate the quality of a broad range of systems, but with limited accuracy on in-domain data,” said the authors.

In the next step, the model’s parameters are fine-tuned using a mix of out-of-domain and in-domain data. This mixing step is known as oversampling and allows the models to assign equal attention to both datasets. The objective is to ensure the model does not forget the generic-domain knowledge acquired during the first step while simultaneously improving its ability to perform QE on the domain-specific data. The in-domain data used in this step was both authentic and synthetic (i.e, data augmentation for domain adaptation).

Finally, the QE model is trained on a specific in-domain dataset until convergence, resulting in a more domain-specific QE model than that obtained in the previous step.

Beneficial Impact

The authors experimented with publicly available language pairs (English-German, English-Chinese, English-Italian, English-Czech, English-Japanese, Romanian-English and Russian-English) and observed “significant improvements” across all language pairs. This indicates that the proposed solution “has a beneficial impact” in addressing the challenges of data scarcity and domain mismatch, said the authors.

According to the authors, this proposed methodology is the first QE approach that employs domain adaptation and domain augmentation. “To the best of our knowledge, domain adaptation has not been widely used in QE,” but “this is understandable since obtaining a generic model for domain adaptation, which improves the models’ generalizability, can be costly,” they explained to Slator.

Additionally, the method is highly reusable and adaptable, as it can be easily applied to different QE tasks by simply fine-tuning a chosen generic model on newly collected QE data. The researchers utilized XLM-RoBERTa in their experiments, but any preferred LLM can be used as long as it meets the input-output criteria. They developed the tool using Hugging Face implementations of LLMs, making it easily replaceable and adaptable to other models.