Amazon released a new machine translation (MT) gender evaluation benchmark called MT-GenEval in December 2022.
According to researchers from Amazon — Anna Currey, Maria Nadejde, Raghavendra Pappagari, Mia Mayer, Stanislas Lauly, Xing Niu, Benjamin Hsu, and Georgiana Dinu — the goal is to better understand how MT systems perform on the task of gender translation accuracy.
As Currey and Hsu explained in a December 8, 2022, blog post, MT systems are prone to gender-bias translation errors and they “sometimes incorrectly translate the genders of people referred to in input segments, even when an individual’s gender is unambiguous based on the linguistic context.”
Moreover, “existing gender evaluation benchmarks have limited diversity in terms of gender phenomena (e.g., focusing on professions), sentence structure (e.g., using templates to construct sentences), or language coverage” — making it even more challenging to assess how MT systems perform in terms of both gender and quality at the same time, according to the research paper describing the benchmark.
The paper was presented at the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Realistic Dataset
MT-GenEval is a large, realistic evaluation set that contains translations from English into eight diverse and widely-spoken languages: Arabic, French, German, Hindi, Italian, Portuguese, Russian, and Spanish.
Unlike commonly used gender bias test sets that are artificially constructed, MT-GenEval dataset is based on real-world data obtained from Wikipedia and includes professionally created reference translations in each of the languages.
Furthermore, it is fully balanced by including human-created gender counterfactuals. “This type of balancing ensures that differently gendered subsets do not have different meanings,” the Amazon researchers explained in the same blog post.
Apart from the 1,150 segments of evaluation data per language pair, 2,400 parallel sentences for training and development were released.
Gender Translation Accuracy
Gender accuracy in translation is defined as “the extent to which a machine translation output accurately reflects the gender of the humans mentioned in the input, restricted to cases where the gender is explicitly and linguistically disambiguated in the context of the input.”
Therefore, in their benchmark, the researchers do not take into account the grammatical gender on inanimate objects, or instances in which the input gender is ambiguous within the given context.
The benchmark has proven to be useful for evaluating both commercial and research systems, including contextual machine translation models and gender-balanced models, in terms of gender accuracy as well as quality.
“MT-GenEval is a step forward for the evaluation of gender accuracy in machine translation,” the Amazon researchers said. “We hope that this benchmark and development data will spur more research in the field of gender accuracy in translation on diverse languages,” they concluded.