Meet MT-Ranker, a New Machine Translation Evaluation System

Researchers Release Mt-Ranker, a New Machine Translation Evaluation System

Automatic evaluation of machine translation (MT) is important for measuring the progress of MT systems with lower subjectivity compared to human evaluation (more on the difference between MT quality evaluation vs. estimation here).

However, traditional approaches, which treat MT evaluation as a regression problem to produce an absolute translation-quality score, have encountered limitations in terms of interpretability, consistency with human annotator scores, and reliance on reference pairs — in the case of reference-based evaluation.

To address these challenges, Ibraheem Muhammad Moosa, Rui Zhang, and Wenpeng Yin from Pennsylvania State University introduced MT-Ranker in a January 30, 2024 paper. MT-Ranker is a system designed to directly predict which translation within a given pair is better, rather than providing an absolute quality score.

As the authors explained, the proposed approach formulates reference-free MT evaluation as a pairwise ranking problem. The pairwise ranking approach has been largely underexplored, with previous applications limited to reference-based evaluation scenarios. “We are the first to model reference-free MT evaluation as a pairwise ranking problem,” they said.

Practical Utility

The authors highlighted that the pairwise ranking approach is sufficient for the most important use case of automatic evaluation metrics: comparing MT systems. Its advantages are manifold:

  • Simplicity, as pairwise ranking is considered more straightforward than regression-based evaluation.
  • Applicability in scenarios without references.
  • Reduced reliance on high-quality manual annotations.

“By eliminating the dependency on human-provided reference translations and comparison data, our system exhibits enhanced practical utility,” they noted. 

By leveraging the encoder of multilingual T5 as the backbone of their models, the authors explored three variants of the model with increasing parameter count: Base (290M), Large (600M), and XXL (5.5B). MT-Ranker was trained using multilingual natural language inference and synthetic data (i.e., synthetically generated translation pairs where one of the translations can be considered better than the other) without any human annotations through a three-stage training process:

  • Pre-training with indirect supervision: This stage served as an indirect supervision for the model to prefer translations that do not contradict the source sentence.
  • Fine-tuning to discriminate between human translation and machine translation: At this stage, training pairs were constructed based on the assumption that a human-written reference translation is generally better than machine translation.
  • Further fine-tuning on weakly supervised synthetic data to address the limitations of the reference-based approach: To address potential limitations arising from the reliance on reference translations in the previous stage, the authors conducted further fine-tuning on weakly supervised synthetic data. This step aimed to mitigate biases introduced by the reference-based approach and provide a more comprehensive coverage of the translation quality spectrum. 

SOTA Correlation With Human Judgements

The authors focused on seven X-to-English and English-to-X language pairs: Czech</>English, German</>English, Japanese</>English, Polish</>English, Russian</>English, Tamil</>English, and </> Chinese to English.

The system was evaluated on benchmark datasets, including WMT20 Shared Metrics Task, MQM20, MQM21, MQM22, and ACES. The Kendall-like Tau correlation was used to measure the correlation between the rankings produced by the MT-Ranker system and the human judgments. 

Comparative analysis against best-performing MT evaluation metrics including COMET-QE, OPENKIWI, and T5-SCORE showcased MT-Ranker’s “state-of-the-art correlation with human judgements” across all benchmark datasets and language pairs. 

The availability of the code on GitHub further promotes transparency and reproducibility in research and development efforts within the MT community.