On October 25, 2022, researchers from the University of Massachusetts at Amherst and Google Research published the results of an experiment designed to test the use of machine translation (MT) in literary documents.
Seeking to improve MT output, the researchers created a large-scale dataset called PAR3, short for Parallel Paragraph-Level Paraphrases. The 121,000-paragraph dataset was developed using 118 non-English novels found in the public domain. The data consisted of published electronic versions of those literary works in 19 languages, along with multiple versions of human translations into English.
To prepare the data, researchers converted the book contents into plain text and stripped them of any text artifacts and notes. Then, they performed machine translation using Google Translate, aligned each paragraph to a minimum of two human translations into English of the same paragraph, and to one machine-translated paragraph, and finally conducted data filtering.
The resulting PAR3 dataset is different from samples used in previous literary MT studies in that it is 20 times larger; segmentation was done using paragraphs instead of sentences; and the aligned source text is included. The experiment thus focused on the output quality of paragraph-level literary translation evaluated by human experts.
MT Prefers MT, Literally
To measure output quality, researchers used the BLEU, BLEURT, and BLONDE MT automatic evaluation metrics as well as human reviewers, which included professional translators and monolingual English raters.
Reviewers had to perform A/B tests on PAR3 to indicate their preference between a Google Translate output paragraph and a reference human translation. Reviewers also provided open comments for each example to explain their choice.
The MT automatic evaluation metrics showed a preference for Google Translate outputs over human translations in the dataset. By contrast, human reviewers overwhelmingly chose human translations over MT 85% of the time.
Quality issues identified by human experts on the MT outputs went beyond accuracy errors and stylistic inconsistencies to include readability, fluency, and “overly literal translations and discourse-level errors (e.g., coreference, pronoun consistency).”
Fine-tuning of the dataset to correct the MT issues consisted of an automatic postediting task. Human reviewers preferred the post-edited translations at a rate of 69% and noted a lower incidence of errors.
Retrain and Repeat
Researchers acknowledged in the paper that “the task of conveying an author’s ideas highlights yet another difference between literary and traditional MT: document-level context is especially critical for the literary domain due to the presence of complex discourse structure, rendering the typical sentence-level MT pipeline insufficient for this task.”
The researchers included an extensive list of other publications on the same topic of literary MT to support the premise that “state-of-the-art MT systems and MT evaluation metrics fail in the literary domain.”
Using paragraph-level segmentation instead of typical sentence-level segmentation did not seem to make a significant difference in the MT outputs. However, by releasing the PAR3 dataset to the general public, the researchers aim to encourage further exploration into the use of MT in literary translation using pretrained language models.