Focusing on BLEU Can Bias Machine Translation Output


A recent paper by top machine translation (MT) researchers concluded that beam search, a very effective way to maximize BLEU scores, can lead to a high rate of misgendered pronouns.

The November 2020 paper, Decoding and Diversity in Machine Translation, is a collaboration between Graham Neubig, Nicholas Roberts, and Zachary C. Lipton at Carnegie Mellon University and Amazon machine learning scientist Davis Liang.

The authors opened by describing the two basic stages of the MT process. In the first, the “modeling” stage, researchers train a conditional language model using neural networks; in the second, the “search” stage, the model searches for the “best” translation using either “greedy decoding” or a beam search to produce predictions.

Beam search, in particular, is very effective at maximizing BLEU scores, “but there is a significant cost to be paid in naturalness and diversity,” the researchers wrote. In practice, this means that MT models typically offer no variability in translations, leading to less engaging output. The researchers also suggested that readers who encounter a given language primarily through these more monotonous translations “might develop a warped exposure to that language.”

Gender pronouns were just one of a number of diversity diagnostics the team introduced in their experiments, but researchers found that even when translating between two gendered languages, search disproportionately chose the more frequent gender, based on the input.

For English to German translations, researchers noted that since the German word “sie” translates as “she,” “they,” or “you” in English, the result was a bias toward the more common gender pronoun, “sie.” By contrast, when translating from French or German to English, male pronouns were more represented in the training set, and the bias skewed male accordingly.

“The singular focus on improving BLEU leaves no incentive to address issues of diversity”

A possible alternative to search might be sampling, which has lower rates of replacing “she” and “her” with male pronouns compared to search. However, the authors warned, the field might not be ready to shift away from search just yet, since sampling does not yield the same consistently high BLEU scores that search does.

SlatorPod – News, Analysis, Guests

The weekly language industry podcast. On Youtube, Apple Podcasts, Spotify, Google Podcasts, and all other major platforms.

SlatorPod – News, Analysis, Guests

“The singular focus on improving BLEU leaves no incentive to address issues of diversity,” they wrote. The researchers’ own future work will explore techniques that can achieve high BLEU scores while producing natural-sounding translations.