Research Shows How to Replicate ‘Translator Style’ in Literary Machine Translation

Machine Translation of Literary Content

Machine translation (MT) systems are mostly designed to serve the business-to-business and business-to-consumer domains. With machine translation’s increased sophistication, researchers are now taking on the final frontier of literary translation

Zeynep Yirmibeşoğlu, Olgun Dursun, Harun Dallı, Mehmet Şahin, Ena Hodzik, Sabri Gürses, and Tunga Güngör from Boğaziçi University published a research paper on July 21, 2023, demonstrating that MT systems can be customized to replicate the unique characteristics of a translator’s style in literary translations.

Employing a hybrid methodology and data augmentation techniques, the researchers obtained promising results in producing high-quality literary machine translations with style transfer. This study marks the first attempt in Turkish literary MT to train models specific to a translator’s works, as highlighted by the authors.

The concept of “translator style” is a subject of increasing interest in corpus-based translation studies. Some scholars argue that stylistic features can be observed solely by analyzing the target text (i.e., the translated work), while others take into account the original author’s style when analyzing the target text.

In this study, the term “translator style” is defined as “a consistent configuration of distinct characteristics that can be identified across multiple translations.” The focus was on investigating the stylistic features introduced by the translator in the translated works, separate from the influence of the source text or the original author’s style.

For the MT model, the researchers used a large pre-trained Transformer model fine-tuned on the translator corpus. The fine-tuning process involved training the model for five epochs while tuning hyperparameters such as batch size, learning rate, and optimizer settings to achieve optimal performance. Focusing on the English-Turkish translation direction, the researchers used pre-trained models from the OPUS-MT project.

For the corpus compilation, they used two datasets: the translator corpus, consisting of the works of the literary translator Nihal Yeğinobali, and the reference corpus, which contains Turkish monolingual texts. 

For the style analysis, the researchers adopted a hybrid methodology, combining close-reading techniques to qualitatively assess lexical, grammatical, semantic, and discourse features, and distant-reading techniques for a quantitative analysis of lexical and morphological traits. The aim was to identify the unique stylistic elements that characterize the translator’s works and evaluate the possibility of replicating their style in MT models.

Data Augmentation

Literary MT requires a large amount of literary parallel data, which is currently unavailable in all languages and very expensive to align, whereas monolingual data in nearly all languages is abundant. To address the challenge of scarce aligned literary parallel data, the researchers employed two data augmentation methods: back-translation and self-training.

Back-translation involved fine-tuning the Turkish-English model on the manually aligned books and using it to translate Turkish sentences into English. These synthetic English sentences were then paired with the original manually aligned books to create augmented data. Self-training, on the other hand, involved using monolingual English sentences to generate synthetic Turkish sentences, aligning with the original translation direction.

The study revealed that fine-tuning on a literary training set significantly improves translation quality on literary test sets. More specifically, adapting a pre-trained model to a translator’s works increases the BLEU score by approximately 45-56% on the literary data and improves the transfer of the translator’s style by 18-40% in terms of cosine similarity, compared to the pre-trained model.

Note: This research is funded by the Scientific and Technological Research Council of Türkiye (TÜBITAK) under Grant No: 121K221 (Literary Machine Translation to Produce Translations that Reflect Translators’ Style and Generate Retranslations). 

Also: Can I Use Chatgpt for Translation?