Separating Accuracy and Fluency Improves Machine Translation Evaluation, Study Finds

Separating Accuracy and Fluency Improves Machine Translation Evaluation

In a February 20, 2024 paper Zheng Wei Lim, Ekaterina Vylomova, Trevor Cohn, and Charles Kemp from the University of Melbourne underscored the importance of accuracy and fluency in translation. They argued that treating these two dimensions separately could improve current machine translation (MT) evaluation metrics and optimize MT training and performance.

Accuracy in translation refers to the faithfulness of the translation to the source text, ensuring that all information from the source text is preserved in the target text. Fluency describes how well the translated text conforms to the norms and naturalness of the target language, making it easy for the reader to understand and process.

The authors mentioned that the relationship between these two dimensions in translation has long been debated. While some suggest that accuracy and fluency trade off against each other — meaning that improvement in one aspect may come at the expense of the other — others argue that they are highly correlated (i.e., go hand in hand) and difficult to distinguish.

There is often a trade-off between accuracy and fluency. A translation that is very accurate may not be as fluent; and a highly fluent translation may sacrifice some accuracy. Translators often need to navigate this trade-off when choosing between different translation options, as maximizing accuracy and fluency simultaneously can be challenging, according to the authors.

In their paper, the authors shed light on this relationship by applying the concept of Simpson’s paradox — a phenomenon where the overall trend in data can be different from the trends in smaller groups within that data — to translation, revealing insights into how accuracy and fluency interact at different levels of analysis.

Specifically, they demonstrated that while accuracy and fluency may appear positively correlated at the corpus level, they exhibit a trade-off when examining individual translation segments and they suggested that the relationship between accuracy and fluency is best evaluated at the segment level. “Of the two levels of analysis, the segment level is the appropriate level for understanding how humans and machine translation systems should choose among possible translations of a source segment,” they said.

Middle Ground

According to the authors, this trade-off has significant implications for assessing translation quality and developing MT systems. Understanding and managing this trade-off is crucial for evaluating translation quality, training models effectively, and optimizing MT system performance.

The authors suggested that current translation quality evaluation methods may need to be adjusted. They noted that in recent WMT General MT Tasks, human evaluation is performed using Direct Assessment and Scalar Quality Metrics (DA+SQM), which combines meaning preservation (accuracy) and grammar (fluency) into a single score and may not fully capture the nuances of accuracy and fluency. Multidimensional Quality Metrics (MQM), on the other hand, provides more detailed scores but is more resource-intensive.

The authors proposed a “middle ground” that extends the DA+SQM approach to consider accuracy and fluency as separate aspects, similar to the methodology used in WMT16. By doing so, automatic MT evaluation metrics like BLEURT and COMET, which are fine-tuned to DA scores, could provide independent scores for accuracy and fluency, offering a more detailed evaluation of translations.

Furthermore, they stressed the importance of developing MT models that can effectively balance the accuracy-fluency trade-off in a way that mimics human decision-making. They highlighted that in certain contexts, such as translating legal texts, accuracy is crucial, while in informal conversations, fluency may be more important. By striking the right balance between accuracy and fluency, these models can enhance translation quality across various contexts.