Neural machine translation (NMT) output is only as good as the quality of its training data (i.e., garbage in, garbage out). And it is not only tangible errors in training data that create problems. Social biases contained in training data can also seep into machine translation output. (Sadly, the notion of great data in, great results out does not always hold true, but it certainly does help.)
A new research paper published in April 2020 entitled “Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem,” co-authored by Danielle Saunders and Bill Byrne, seeks to minimize the impact of this specific social bias.
According to her LinkedIn profile, Saunders is a PhD student focusing on statistical and neural machine translation using the Tensorflow framework. She also works part-time as a Research Scientist at language service provider SDL. Byrne is Professor of Information Engineering at Cambridge and Course Director for the University’s MPhil in Machine Learning, Speech, and Language Technology. He was Director of SDL’s UK R&D office until December 2019 and now works part-time for Amazon on Alexa Search.
The natural language processing (NLP) events calendar that normally spurs a flood of new MT research papers is carrying on despite the lockdown. Research publication activity is now high in the run up to the 2020 Annual Conference of the Association for Computational Linguistics (ACL), which will take place online in July 2020.
Gender bias is an evolving area of MT research, and one that is being explored across many disciplines of NLP.
The Gender Bias Problem
As the two researchers from the University of Cambridge, UK, point out, training data tends to contain fewer sentences that refer to women than to men. In short, it is gender-biased. This is problematic in NMT because “gender bias has been shown to reduce translation quality, particularly when the target language has grammatical gender.” In fact, it may even amplify biases.
Not only that, but in gender-inflected languages, gender-biased training data can even lead to “translations with identifiable errors,” the paper reads. For example, they say, mentions of male doctors are more reliably translated than those of male nurses.
Google Translate ran into a similar problem back in 2018. Phrases that included words such as “strong” or “doctor” would generally contain masculine pronouns when translated, while instances of “beautiful” and “nurse” would result in translated phrases with feminine pronouns.
The issue prompted Google to update its translation framework so that translations into gender-inflected languages (e.g., French and Spanish) would contain both masculine and feminine variations of the phrase.
For longer phrases, it was more complicated to resolve, and Google made significant changes to its framework. Although Google then claimed that its new system could “reliably produce feminine and masculine translations 99% of the time,” researchers continued to identify a number of shortcomings.
Fine-Tuning Rather Than Training
One way to reduce gender-bias in NMT output is to cut the problem off at the root: remove gender bias in the training data. However, this is too big an undertaking in many instances.
Instead, Saunders and Byrne approached gender debiasing as a domain adaptation problem; that is, by attempting to filter out gender bias in the output with fine-tuning rather than training.
Their intention was to use a form of fine-turning called ‘transfer learning’ on a small dataset that contained only unbiased sentences. They believed they would see “strong and consistent improvements in gender debiasing with much less computational cost than training from scratch.” In addition, this approach allows data privacy to be preserved because the training data itself does not need to be accessed or touched.
Using their debiasing model, the researchers also hoped to demonstrate that it was possible to remove gender bias in the output of a number of commercial MT systems: Google, Amazon, Microsoft, and SYSTRAN.
They first created a “tiny, handcrafted profession-based dataset” that would be used for fine-tuning. This dataset contained gender-balanced English sentences, which were later translated into three target languages: German, Spanish, and Hebrew.
Each English sentence contained professions sourced from US labor statistics and was structured as follows: The [PROFESSION] finished [his|her] work.
There were 194 professions and 388 English sentences in total.
For contrast and to compare the results, the researchers created an approximated counterfactual dataset, in which, for every sentence containing a gendered term, a bias-reversed equivalent was added.
They planned to use the tiny unbiased datasets to remove bias in NMT output through transfer learning. Transfer learning, however, is prone to the phenomenon of “catastrophic forgetting,” which negatively affects translation quality.
To minimize the effects of catastrophic forgetting while preserving gender balance, Saunders and Byrne used two other approaches at their disposal: a regularized training procedure known as ‘Elastic Weight Consolidation’ (EWC), and a two-step lattice rescoring procedure.
The researchers used different training data for their experiments for each of the three language pairs: English-German, English-Spanish, and English-Hebrew. However, “all three datasets have about the same proportion of gendered sentences: 11–12% of the overall set,” they said.
With fine-tuning, the researchers showed that both EWC and lattice rescoring “allow debiasing while maintaining general translation performance.” Lattice rescoring, they said, “although a two-step procedure, allows far more debiasing and potentially no degradation, without requiring access to the original model.”
Saunders and Byrne also showed that lattice rescoring “can be applied to remove gender bias in the output of ‘blackbox’ online commercial MT systems.”
The researchers do not claim to have found the fix for gender bias in NMT and point out that the paper only explores the issue at sentence-level. They do, however, suggest that this small-domain adaptation is “a more effective and efficient approach to debiasing machine translation than counterfactual data augmentation.”