Revisiting Controlled Language for Better Machine Translation Quality

Controlled Language Machine Translation

The terms “controlled authoring” and “controlled language” have been part of the Natural Language Processing (NLP) lingo since the first computer assisted translation (CAT) and machine translation (MT) tools became available in the 1980s. For MT specifically, researchers at universities like Carnegie Mellon were discussing controlled language attributes as early as 1989 in the multilingual document production KANT project.

Key attributes studied in those days and still relevant today are source language controlled vocabulary and grammar; domain-specific semantic models; disambiguation; and stylistic quality. 

In a paper (download) published in December 2022, researchers Yifan Wang, Zewei Sun, Shanbo Cheng, Weiguo Zheng, and Mingxuan Wang from Fudan University and ByteDance AI Lab focus on stylistic quality as a way to achieve better MT quality.

The paper discusses previous and novel approaches for a controlled language MT output. The methodology they propose centers on creating a style benchmark, automating evaluation, and using a prompt training and retrieval methodology.

The researchers remarked on how previous related studies are restrictive, with style benchmark datasets centering mainly on “formality and politeness,” and addressing European languages. Another issue is the need for iterative training and constant fine-tuning to adapt language models every time a new style is introduced.

Multisampling Style to Avoid Iterative Fine Tuning 

To create a benchmark for stylized MT, researchers began with a broad definition: “translations with certain language characteristics or styles,” but with the ultimate goal of ensuring translation quality. They call this benchmark “multiway stylized machine translation (MSMT).” 

The datasets consisted of a single source (English or Chinese) and multiple target references (Chinese, English, Korean, and Portuguese). Each dataset included four directions and diverse language styles using labeled sentences. “Each source sentence has two references in different styles, which is convenient for automatic evaluation,” describe the researchers in the paper’s introduction.

To avoid repeated fine-tuning, the researchers propose a methodology they are calling “style activation prompt (StyleAP).” The premise of the tests under this methodology is that “once the model has been trained on all kinds of data with various styles, it has the potential to generate any style,” and there would be no semantic difference.

The researchers conducted their MT experiments in several language pairs. They are English to Classical and Modern Chinese; Chinese to Early Modern English (including The Complete Works of William Shakespeare as part of this corpus) and to Modern English; English to Honorific and Non-honorific Korean; and English to Portuguese from Brazil and to Portuguese from Portugal (the locale variation used in this case as a stylistic differentiator).

How does StyleAP Compare to Other Models?

The researchers claim that StyleAP’s quality performance vs. a baseline transformer model (trained on context and meaning that tracks relationships in sequential data) is comparable.

Compared to a style transfer model (training the translation model with regular parallel data and then training the transfer model with stylized data), StyleAP “significantly enhances the transfer ratio,” with the caveat that very short sentences have no discernible style.

Against a tag-tuning model (using tokenized source text with a known style, which results in different styles with different tokens), StyleAP comes out ahead in that it does not require extra tags or extra tuning.

The researchers concluded that “The quality scores are comparable with the baseline model, while the transferred ratios are much higher [in StyleAP] than the baseline model,” and that “our method can effectively translate the source text into a sentence with specific attributes without quality loss.”

The paper’s Appendix describes the language expert evaluation phase of the experiments. A human evaluated MT output accuracy and style on two different scales for all language combinations. 

Accuracy was measured on a scale from 0 to 4, where 0 represented “translated text [that] is almost completely wrong or completely incomprehensible,” 1 meant that the translation had serious errors, 2 indicated the presence of a few errors impacting understanding, 3 meant the presence of a few errors did not impact understanding, and 4 represented a translation with “no errors and no modification required.” 

The human evaluation of style had only two criteria, in which 0 meant that the style had not transferred to the translation and 1 meant a full style match.