Amazon Flags Problem of Using Web-Scraped Machine-Translated Data in LLM Training

Amazon Research on Machine Translated Web Content Data Quality LLM Training

On January 11, 2024, Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, and Marcello Federico from Amazon published a paper investigating the prevalence and quality of machine translation (MT) on the web.

They found that “a shocking amount of the web is machine translated” into many languages and the quality of these multi-way translations is often low, underscoring the importance of considering data quality and sources when training large language models (LLMs). 

According to the researchers, this multi-way parallel, machine-generated content is not only prevalent in translations for lower resource languages but also constitutes a “large fraction of the total web content”.

To analyze the characteristics of machine-translated content, the team created a large multi-way parallel corpus called Multi-Way ccMatrix (MWccMatrix), encompassing 6.4 billion unique sentences in 90 languages. 

The corpus consisted of translation tuples, containing two or more sentences in different languages which are translations of each other. They examined patterns of multi-way parallelism, which refers to sets of sentences that are directly translated from one another in three or more languages.

Low Quality

The analysis revealed that the quality of these multi-way translations is often low. Specifically, multi-way parallel translations — especially those involving a high number of languages — exhibited significantly lower quality compared to 2-way parallel translations. 

“The more languages a sentence has been translated into, the lower quality the translations are, suggesting a higher prevalence of machine translation,” said the researchers.

This trend was consistent across all eight language pair directions considered, such as English→German, German→English, French→German, German→French, English→Japanese, Japanese→English, English→Chinese, and Chinese→English.

The researchers found a selection bias towards “shorter and more predictable sentences.” They observed that these sentences predominantly came from low-quality articles, and they noted that this bias towards short sentences from low-quality articles was due to “low-quality English content being translated en masse into many lower resource languages via MT.” 

In addition, the multi-way parallel data had a different topic distribution. They employed professional linguists to classify a random sample of English sentences into different topics and discovered a dramatic shift in the distribution of topics when comparing 2-way parallel data to 8+ way parallel data.

Serious Concerns

The Amazon researchers suggested that these findings have important implications for multilingual model builders and for training LLMs, raising “serious concerns” about the quality of training data for LLMs when sourced from web-scraped content that includes low-quality machine translations.

They emphasized that data quality is “crucial” in LLM training and noted that modern AI is enabled by huge amounts of training data — typically several hundred billion tokens to a few trillion tokens — making training at this scale only possible with web-scraped data. The prevalence of machine-translated content — especially in lower resource languages — could lead to less fluent models with more hallucinations.

In response to these challenges, the researchers proposed that MT detection could be helpful in filtering monolingual text in lower resource languages, and that multi-way parallelism is a promising way to detect low-quality, machine-translated data, especially in lower resource languages.

To facilitate further exploration and analysis, the researchers have released the code for reproducing the corpus and study.