‘Human Parity Achieved’ in Machine Translation — Unpacking Microsoft’s Claim

A year and a half ago, Google first claimed that its new neural machine translation (NMT) systems can produce some translations that were “nearly indistinguishable” from human output.

But while Google’s “nearly indistinguishable” claim was buried deep on page 18 in the paper’s technical discussion and carefully hedged, Microsoft came out guns blazing saying in the very title of a new research paper that they achieved “human parity” in Chinese to English translation, no less.

According to Microsoft’s March 14, 2018 research paper with the full title of “Achieving Human Parity on Automatic Chinese to English News Translation,” a few variations of a new NMT system they developed have achieved “human parity,” i.e. they were considered equal in quality to human translations (the paper defines human quality as “professional human translations on the WMT 2017 Chinese to English news task”).

Within 24 hours, mainstream tech outlets such as TechCrunch, GeekWire, TechRadar, and ZDNet picked up on the story, predictably taking the human parity claim at face value.

Microsoft came up with a new human evaluation system to come to this convenient conclusion, but first they had to make sure “human parity” was less nebulous and more well-defined.

Microsoft’s definition for human parity in their research is thus: “If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.”

In mathematical, testable terms, human parity is achieved “if there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations.”

New Human Evaluation Methods

The research team used the 2017 Conference for Machine Translation test set for news (WMT2017 newstest) data for training and testing their new NMT system variants.

The Microsoft team used bilingual human evaluators and presented them with both source text and translation output from the WMT2017 newstest set, and asked them to score the translation from 0 to 100. The top performing engine in the WMT2017 conference was Sogou Inc’s Sogou Knowing NMT system. The researchers also had their evaluators assess the output of Sogou Knowing NMT.

Part of WMT2017’s newstest task. Chinese source and English target translations side-by-side. These are the reference human translations used in the conference.

They showed the evaluators output from nine systems. According to the research paper, there were around 2,000 assessments made per system (at least 1,827 per system).

Ranked from best to worst, according to Microsoft’s human evaluators:

  1. Microsoft’s new NMT engine variation (Combo-6)
  2. Reference human translations used for this research
  3. Microsoft’s new NMT engine variation (Combo-5)
  4. Microsoft’s new NMT engine variation (Combo-4)
  5. WMT2017’s reference translations that were post-edited machine translation
  6. Sogou Knowing NMT
  7. WMT2017’s reference human translations used in the conference
  8. Microsoft’s existing production NMT system
  9. Google’s existing production NMT system

According to Microsoft researchers, the first four are grouped together and are in parity with each other, i.e. their scores are so close as to be indistinguishable from each other.

Microsoft Versus Sogou

Curiously, Microsoft’s research paper also shows that using this new evaluation method, Sogou Knowing NMT’s score is so close to the score of WMT2017’s reference human translations that they are considered indistinguishable.

It appears Microsoft also unintentionally showed using their new evaluation method that Sogou achieved human parity at least in comparison to the WMT2017 reference human translations.

Meanwhile, both Microsoft and Google’s existing production NMT systems scored lowest.

See for yourself: English output of Microsoft’s highest scoring NMT system variation taken from their open source Github link. From the content, it does not appear that average sentence length is very long nor is the verbiage very complex.

They also used Bilingual Evaluation Understudy (BLEU) to measure any gains from previous work that also used BLEU points for scoring, including WMT2017’s rankings of participating NMT engines.

Most of Microsoft’s NMT model setups (10 out of 12, baseline included) reportedly bested Sogou Knowing NMT’s 26.40 BLEU points. Microsoft’s top performing NMT variant beat the state-of-the-art by 1 BLEU at 27.40 points, all using the same training data from WMT2017.

Shiny New Tech and Training Methods

The research team developed new NMT engines for their experiment. They tried recurrent neural networks, convolutional networks, and transformers, and ultimately used the transformer engines reportedly due to better output.

Next, they also upgraded their training regimen.

They employed a recent technique called Dual Learning that allows their model to learn from both source-to-target and target-to-source directions of bilingual training data. They also used Deliberation Networks that uses another decoder layer to “polish” the translations of a first decoder in an NMT system—like an editor polishing the draft of a writer. Additionally, they also employed joint training and agreement regularization.

They basically mixed and matched all these methods to iteratively improve translation output across several variations of the same NMT system.

The Microsoft team also filtered the training data from WMT2017. After cleaning up and filtering the training data, whey were left with 18 million bilingual sentence pairs and around 7 million Chinese and English monolingual sentences.

Future Work

Microsoft made everything about this new research open source, citing external validation and future research as the reason.

As for when, if ever, Microsoft plans to transition their new systems into production, a company spokesperson told ZDNet: “We’re working to bring this to production as soon as possible, but we have nothing to announce at this time.”

Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.