As machine translation (MT) quality for major language pairs has improved, MT has moved from the ivory tower to become industry standard. For some language service providers (LSPs), adopting MT for specific project types and subject matter has been a no-brainer.
Past research has shown that having human post-editors polish MT output is more efficient than having a human translate text from scratch, saving both time and cost. But new research from Memsource and Charles University in Prague has analyzed those claims more closely — and found a much more complicated relationship between MT quality and human productivity.
The September 2021 paper, Neural Machine Translation Quality and Post-Editing Performance, focused on state-of-the-art MT models. Thirty professional English-to-Czech translators and translation reviewers worked in two stages — post-editing and review — on output from 13 MT engines. The experiment also included a “variant with no translation (Source) and with a pre-existing reference translation (Reference).”
The MT engines included two commercially available models (Google and Microsoft) as well as LINDAT, a document-level system developed by Charles University and made available for non-commercial purposes.
Participants used Memsource as a translation productivity (a.k.a. CAT) tool for both post-editing and review. The CAT tool measured time spent editing and thinking (i.e., the time between edits) for each stage. The researchers estimated the linguists’ post-editing times and the quality of the final texts.
The authors, Memsource AI Research and Development Manager Aleš Tamchyna and Charles University researchers Vilém Zouhar, Martin Popel, and Ondřej Bojar, specifically set out to establish recommendations for using MT in localization workflows.
By testing a range of MT models, the study was designed to reflect “realistic scenarios in localization workflows where users can typically decide among several engines of comparable but not identical performance.”
Similarly, the text came from domains common to LSP use-cases for MT: news texts, lease agreements, audio documents, and technical documentation. Post-editors and reviewers were instructed to correct mistranslations, inaccuracies, and grammar or style errors while avoiding preferential changes.
“As expected, post-editing human reference led to the smallest amount of edits and time spent,” they wrote. “Contrary to current results, translating from scratch was not significantly slower than post-editing in either of the two phases.”
SlatorCon Remote June 2022 | Super Early Bird $98
A rich online conference which brings together our research and network of industry leaders.
Complexity and a Quite Striking Discovery
Tamchyna told Slator that his main takeaway from the study is the complexity of the relationship between post-editing productivity and MT quality.
“We found that the effect is rather weak and it is difficult to discern the exact impact of MT quality on the post-editing speed,” he said. In particular, he added, noise in the measured editing times may have obscured the impact on post-editing productivity. Larger-scale settings in future studies might help average out the noise.
Already looking ahead, Tamchyna noted at least one “quite striking” finding worth further investigation. Researchers expected the reviewers to make significantly more (or more significant) changes to raw MT output than they made to translations by human post-editors, but this was not the case.
“I suspect it is simply because they expected the translations to already be correct,” Tamchyna said. But it gives rise to the question: “Does that mean it might be possible to skip the post-editing phase entirely and simply treat MT outputs as if they were produced by a human translator?”
Much like the MT quality–human productivity relationship itself, a linguist’s feelings toward MT are complicated, according to a survey of the study participants.
The survey results indicated that professional translators and reviewers have a “clear preference for using even imprecise TM [translation memory] matches (85–94%) over MT output. This corresponds to the general tendency towards the opinion that post-editing is more laborious than using a TM.”
On the other hand, the authors wrote, there is “some level of trust in MT in the process helping to improve translation consistency and produce overall better results.”
They concluded, “The current recommendation for the industry is that they should not expect small improvements in MT (measured by automatic metrics) to lead to significantly lower post-editing times nor significantly higher post-edited quality.”