No Post-Editing Output: Microsoft Releases Massive Machine Translation Test Set

Microsoft Machine Translation Dataset

On 24 November, 2022, Microsoft released NTREX-128 (News Test References for MT Evaluation of 128 Languages), the second largest human-translated data set for machine translation (MT) evaluation.

NTREX-128 expands multilingual testing for translation from English into 128 target languages and consists of 123 documents (1,997 sentences, 42k words).

As Christian Federmann, Tom Kocmi and Ying Xin wrote in the paper describing NTREX-128, the test set release adds a new benchmark for the evaluation of massively multilingual MT research. “We release NTREX-128 in the hope that it may be useful for the scientific community,” they said.

Data Scarcity

Test data are important for assessing the quality of massively multilingual MT models. However, creating test data is expensive — especially when taking into account test sets for more than 100 different languages. As a result, progress in the field is hampered by the small amount of test data that is available, according to the authors of the above-mentioned paper.

There are a few multilingual benchmark test sets, such as TICO-19, FLORES-101, and FLORES-200. But “more data will be needed to boost research efforts,”  Federmann, Kocmi, and Xin said.

For that reason, they started collecting test data for massively multilingual models and made this data available to the research community as another benchmark for the evaluation of massively multilingual MT models.

Test Data Quality

Given that test data must be of a high quality to be useful, the authors specified two requirements: 1) reference translations should be done by bilingual native speakers of the respective target language; and, interestingly, 2) they shouldn’t be produced based on post-editing MT output.

“Reference-based evaluation metrics, by design, have an inherent problem with reference bias,”  Federmann, Kocmi, and Xin explained. If reference translations were produced by post-editing MT output, the quality might be inferior compared to from-scratch-translation, but most crucially, it might give the respective MT system an unfair advantage in competitive evaluations.

To produce the NTREX-128 data set, the original English WMT19 test set was sent to professional translators, who had access to the full document context as well. However, the authors cannot be sure if (or to which extent) they have used this information. Moreover, in order to ensure high-quality test data, the authors propose a quality filtering method based on human evaluation.

Finally, the authors recommend that NTREX-128 should be used to evaluate English-sourced translation models but not in the reverse direction, demonstrating that the directionality of test set translation plays an important role.

The NTREX-128 data set is available on GitHub.