NTREX-128 expands multilingual testing for translation from English into 128 target languages and consists of 123 documents (1,997 sentences, 42k words).
As Christian Federmann, Tom Kocmi and Ying Xin wrote in the paper describing NTREX-128, the test set release adds a new benchmark for the evaluation of massively multilingual MT research. “We release NTREX-128 in the hope that it may be useful for the scientific community,” they said.
Test data are important for assessing the quality of massively multilingual MT models. However, creating test data is expensive — especially when taking into account test sets for more than 100 different languages. As a result, progress in the field is hampered by the small amount of test data that is available, according to the authors of the above-mentioned paper.
For that reason, they started collecting test data for massively multilingual models and made this data available to the research community as another benchmark for the evaluation of massively multilingual MT models.
Test Data Quality
Given that test data must be of a high quality to be useful, the authors specified two requirements: 1) reference translations should be done by bilingual native speakers of the respective target language; and, interestingly, 2) they shouldn’t be produced based on post-editing MT output.
We’re thrilled to announce our release of the second largest human-translated parallel testset, featuring 128 languages each having 2000 sentences translated with a document context without post-editing.
Give it a try at https://t.co/ej66qSfBXo@cfedermann and Ying— Tom Kocmi (@KocmiTom) January 16, 2023
“Reference-based evaluation metrics, by design, have an inherent problem with reference bias,” Federmann, Kocmi, and Xin explained. If reference translations were produced by post-editing MT output, the quality might be inferior compared to from-scratch-translation, but most crucially, it might give the respective MT system an unfair advantage in competitive evaluations.
To produce the NTREX-128 data set, the original English WMT19 test set was sent to professional translators, who had access to the full document context as well. However, the authors cannot be sure if (or to which extent) they have used this information. Moreover, in order to ensure high-quality test data, the authors propose a quality filtering method based on human evaluation.
Finally, the authors recommend that NTREX-128 should be used to evaluate English-sourced translation models but not in the reverse direction, demonstrating that the directionality of test set translation plays an important role.
The NTREX-128 data set is available on GitHub.