As Machine Translation Gains Ground, It Taints Training Data — New Research Shows

Machine Translation Taints Training Data

Researchers have long acknowledged that what comes out of a machine translation (MT) system is only as good as what goes in. Now, experts are asking whether the proliferation of MT is leading to less-than-ideal translations.

“Ten years ago, data contaminated with machine translation was a leading cause of bad translations,” ModelFront CEO and co-founder Adam Bittlingmayer told Slator. “The problem has only increased exponentially over the last decade.”

According to an April 2021 paper, “Documenting the English Colossal Clean Crawled Corpus,” “Fitting models on non-natural language can lead to issues in production.”

The authors, hailing from the University of Washington and the Allen Institute for Artificial Intelligence, wrote, “As the use of models which can generate natural language text proliferates, web-crawled data will increasingly contain data that was not written by humans.”

The authors examined to estimate the proportion of machine-generated text out there. They found that more than 10% of the patents in this corpus came from patent offices that require submissions in a language other than English.

The Google domain typically uses MT to translate those documents into English, while other physical documents have been scanned, run through optical character recognition (OCR), and then machine translated.

Natural language processing (NLP) post-doctoral researcher Bram Vanroy mused in an April 19, 2021 tweet, “Recent studies have shown that multilingual crawled datasets are noisy in terms of the languages that they contain. Has the same been done to check whether such corpora contain original text vs. MT or translationese? Maybe we’re all just training MT systems on…mostly MT.”

Vanroy’s tweet may refer in part to “Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets,” a March 2021 paper resulting from the collaborative work of more than 50 authors at a range of institutions, including Google, Hugging Face, Intel Labs, and several universities.

In a manual audit of 205 language-specific corpora, researchers found issues in lower-resource corpora, and noted that “a significant fraction contains less than 50% sentences of acceptable quality.” 

While the authors did not explicitly link these issues to MT, they did point out that the quality of automatically crawled and filtered datasets tends to be lower than that of hand-curated collections.

“Our quantitative analysis reveals surprisingly low amounts of valid in-language data, and identifies systematic issues across datasets and languages,” the researchers said.

Bittlingmayer tweeted on April 20, 2021, “[Marcin Junczys-Dowmunt] has a theory that back-translation was accidentally implemented for xx→en a decade before it was invented and explains the edge over en→xx. From the error analyses I did in those days, it seems about right.”

He emphasized to Slator that it is important to distinguish between those two cases. “For example, when crawling the web for data to train a Spanish-to-English system, ingesting data that was machine-translated from English to Spanish even helps. In fact, that’s one of the key tricks that Google, Microsoft, and DeepL use today: back-translation.”

Junczys-Dowmunt said on Twitter, “It’s also likely far worse for xx<->yy. There you can assume that nearly everything is just MT.” In later tweets, he noted that language pairs such as Swedish–Korean are likely to depend on a pivot language due to the scarcity of human translators working in those languages.

“Unless it comes from actual multi-lingual sources, then it’s just all translationese,” Junczys-Dowmunt concluded.