How does machine translation output post-edited by human linguists (PEMT) differ from translations from scratch; that is, translation produced without help or interference from a computer-generated first draft (HT)?
Three years into the age of neural machine translation, and as a significant portion of the language industry has transitioned to a post-editing-only world, this is a highly relevant question to explore.
In a paper entitled “Post-editese: an Exacerbated Translationese” published on July 1, 2019, University of Groningen Assistant Professor Antonio Toral said his “current research can be framed as a quest to find out whether there is evidence of post-editese.”
“I got interested back in 2014 in machine-assisted translation of literary texts (novels),” Toral told Slator. “I’ve previously conducted a post-editing experiment with professional literary translators and the results were positive in terms of productivity. However, I note that in that particular text type the reading experience is really important. Hence, I got interested in analyzing human vs post-edited translations. This paper is my first attempt at that.”
In the paper, Toral conducted a set of computational analyses where he compared “PE against HT on three different datasets that cover five translation directions with measures that address different translation universals and laws of translation: simplification, normalisation and interference.”
Toral found that PEMT has lower lexical variety and lower lexical density. Furthermore, he found that the sentence length of PEMT corresponds more closely to that of the source text. In terms of part-of-speech (PoS) sequences, too, PEMT resembles the original more closely than HT.
It is no coincidence that Toral’s interest in the question arose in the context of literary translation, which lies at the outer edge of the spectrum of text types requiring a translator to take liberties. While neural machine translation has come a long way in producing fluent output, computers still lack the ability to rewrite, add, combine, or remove in the fundamentally creative way humans are able to.
In a nutshell, the demonstrated productivity gains achieved by PEMT come at the cost of producing a translation that, according to the paper, “is simpler and more normalised and has a higher degree of interference from the source language than HT.”
Asked about potential limitations to keep in mind when reviewing his research, Toral commented: “One issue is that the metrics I have used are rather simple and work at surface level. I cannot say yet whether an analysis using more linguistically-oriented features (e.g., using syntactic information) would lead to the same results. Another issue is that the datasets are rather small; I hope with these results I’ll be able to convince someone in industry to use bigger datasets.”