Overcoming ‘Male-As-Norm’ Behavior in Machine Translation

Overcoming ‘Male-As-Norm’ Behavior in Machine Translation

Despite significant improvements in machine translation (MT) quality, gender bias remains a concern. Automated translation systems often get gender wrong due to biases in their training data.

Ranjita Naik, Spencer Rarrick, and Vishal Chowdhary affiliated with Microsoft highlighted in a paper published on November 27, 2023, that these sophisticated systems tend to absorb and sometimes amplify the societal biases inherent in the training data.

In another paper published on November 30, 2023, researchers Matúš Pikuliak, Andrea Hrckova, Stefan Oresko, and Marián Šimko from the Kempelen Institute of Intelligent Technologies, emphasized that these biases not only pose potential issues for individual users but also influence downstream systems utilizing these translations. They explained that “an AI system trained with data translated with a biased MT system might learn these MT-injected biases, even if they did not exist in the source data.”

The authors attribute these biases to gender stereotypes, underscoring the importance of defining and understanding them. Pikuliak, Hrckova, Oresko, and Šimko highlighted the multitude of gender stereotypes existing worldwide, varying across cultures, and pointed out the oversight in many previous works that treat stereotypes as singular entities. “Many previous works do not consider this and they work with the concept of stereotype as if it were a singular entity,” they said.

To address this, the Kempelen team employed a more fine-grained approach to study “which specific stereotypes were learned by the models and how strong the stereotypes are,” and they released GEST, a new dataset for measuring gender-stereotypical reasoning in English-to-X MT systems. 

Strong Male-As-Norm Behavior

Using GEST, they evaluated Amazon Translate, DeepL, Google Translate, and NLLB200 revealing strong “male-as-norm” behavior, with Amazon Translate identified as “the most masculine system”, followed by Google Translate.

They also observed similar tendencies for gender-stereotypical reasoning across these systems, suggesting they might have learned from “very similar poisoned sources.” According to the authors, these systems “think” that women are beautiful, neat, and diligent, while men are leaders, professional, rough, and tough.

Having a better understanding of the MT systems’ behavior, the authors recommended a focused approach to address specific issues, such as preventing models from sexualizing women. “This might be more manageable compared to when gender bias is taken as one vast and nebulous problem,” they said.

Mitigating Gender Bias

The Microsoft researchers looked at measuring and mitigating gender bias in MT systems. They emphasized that gender bias in MT goes beyond sentences with ambiguous gender, extending to instances where gender can be inferred from the context, yet the MT output contradicts the gender information present in the source.

To address this, they proposed fine-tuning a base model using a gender-balanced in-domain dataset derived from the training corpus and introduced a novel domain-adaptation technique, leveraging counterfactual data generation methods. 

The process involved selecting gendered sentences from the base model training corpus and generating counterfactuals by creating gender-swapped versions, with a specific focus on sentences containing masculine or feminine forms of profession animate nouns. 

The authors highlighted the advantages of their approach, noting its reliance on a subset of the in-domain training corpus for fine-tuning data generation to avoid catastrophic forgetting otherwise seen during domain adaptation. They stressed its purely data-centric nature, requiring no modifications to training objectives or additional decoding models. Additionally, they underlined the utilization of counterfactual data generation techniques, providing a dynamic and diverse dataset during model training.

Accuracy Improvements

The evaluation of their approach, conducted using the WinoMT test set tailored for profession words, demonstrated significant accuracy improvements for Italian, Spanish, and French.

“We achieve 19%, 23%, and 21.6% […] accuracy improvements over the baseline for Italian, Spanish, and French respectively, without significant loss in general translation quality,” they said.

The authors concluded by highlighting potential directions for future work, including extending techniques to address non-binary gender and improving the handling of complex sentences involving multiple individuals, “where different entities get gender-swapped in the source and target.”