In its ongoing quest to reduce gender bias in machine translation (MT), Google has released a dataset of Translated Wikipedia Biographies. The ultimate goal, according to Google researchers, is to improve machine learning systems focused on gender and pronouns in translation by coming up with a benchmark for accuracy.
“Because they are well-written, geographically diverse, contain multiple sentences, and refer to subjects in the third person (and so contain plenty of pronouns), Wikipedia biographies offer a high potential for common translation errors associated with gender. These often occur when articles refer to a person explicitly in early sentences of a paragraph, but there is no explicit mention of the person in later sentences,” the researchers said in a June 24, 2021 blog post.
They said the Translated Wikipedia Biographies dataset can be used to evaluate gender bias in MT output along common translation errors — among which the researchers singled out three, pro-drop, possessives, and gender agreement.
Pro-drop or pronoun dropping occurs in certain languages where, as the name implies, pronouns are left out because they can be inferred. Examples include Japanese, Hindi, and Korean.
Possessives include pronouns in English, such as “his” and “her,” which determine gender but exclude it in others, such as “mine” and “yours.” Compare that to French, for example, where possessives must agree with the nouns they modify (e.g., “mon” for male, “ma” for female, etc.), while in English “my” would apply to both.
Gender agreement has to do with the modifier agreeing with a person’s gender. In Spanish, for instance, “la médica” would be used for a female doctor and “el médico” for male, while English would not make the same distinction.
The same blog post features a sentence that, if run through Google Translate today, would contain these three errors. In English, the sentence reads: “Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911.”
At this writing, Google Translate uses “El distinguido científico” to refer to Marie Curie in Spanish and “Der angesehene Wissenschaftler” in German; and so on.
According to the Google research team, they “extracted biographies from Wikipedia according to occupation, profession, job and/or activity” to build a set that represents genders and geographies equally. Thus, the dataset comprises entries about people from over 90 countries covering all the world’s regions.
Google said that while the newly released dataset enables a new way to analyze gender bias in MT, which it introduced in April 2020, it “doesn’t aim to cover the whole problem.”
Rather than being prescriptive in the optimal approach to fixing gender bias, the Google team said they merely aim “to foster progress on this challenge across the global research community.”