Researchers Propose New Way to Detect Machine Translation — Maybe Google Could Use It

Most machine translation (MT) may still need a human touch before being deemed usable in a professional context. But, even without it, it is becoming less easy to distinguish human writing from machine-generated text.

Yet these gains have one drawback: It is now easier for MT to be abused for malicious purposes, such as plagiarism and fake reviews.

Even Google is having trouble keeping up. In a 2018 Google Webmasters Hangouts session, Senior Webmaster Trends Analyst John Mueller told participants it was possible that some machine translated content had become fluent enough to fool Google’s own ranking algorithms.

A year later, on October 22, 2019, Mueller responded to related questions on Twitter, stating that text translated by services such as DeepL or Google Translate do not automatically trigger a Google penalty or manual action. (The caveat, though, is that if the translation quality is poor, the content might not rank well.)

Traditional methods of detecting machine translation have their own shortcomings, such as ignoring semantics or only working well with large texts. But an approach explored by Hoang-Quoc Nguyen-Son, Tran Phuong Thao, Seira Hidano, Shinsaku Kiyomoto, researchers at KDDI Research and the University of Tokyo, may help.

The team randomly selected 2,000 English-to-French sentence pairs to test the new method. In this case, researchers used the English sentences as the human-written text, and the French translations were back-translated into English using Google Translate. The resulting English texts were then back-translated by Google Translate (into French, back into English) several more times.

According to the team’s October 2019 paper, human-written sentences showed more variation in word usage and structure between back-translations than machine-translated sentences. In other words, the more times a text was machine-translated, the more similar the resulting back-translation was to the original text.

The researchers then used BLEU scores to estimate the similarity of a text and its back-translation to identify content that had been machine translated or machine back-translated.

The team concluded that their work with English and French, and later experiments on Japanese, outperformed previous methods.

Future research will evaluate how well the new technique can identify problematic text, such as fake news, which may become even more relevant to Google if the tech giant ever changes its webmaster guidelines to permit machine-generated content.