Meta Tries Making Human Evaluation of Machine Translation More Consistent

Meta Quality Estimation for Machine Translation

Although automatic evaluation metrics, such as BLEU, have been widely used in industry and academia to evaluate machine translation (MT) systems, human evaluators are still considered the gold standard in quality assessment.

Human evaluators use quite different criteria when evaluating MT output. These are determined by their linguistic skills and translation-quality expectations, exposure to ΜΤ output, presentation of source or reference translation, and unclear descriptions of the evaluation categories, among others. 

“This is especially [problematic] when the goal is to obtain meaningful scores across language pairs,” according to a recent study by a multidisciplinary team from Meta AI that includes Daniel Licht, Cynthia Gao, Janice Lam, Francisco Guzman, Mona Dia, and Philipp Koehn.

To address this challenge, the authors proposed in their May 2022 paper, Consistent Human Evaluation of Machine Translation across Language Pairs, a novel metric. Called XSTS, it is more focused on meaning (semantic) equivalence and cross-lingual calibration, which enables more consistent assessment.

Adequacy Over Fluency

XSTS — a cross-lingual variant of STS (Semantic Textual Similarity) — estimates the degree of similarity in meaning between source sentence and MT output. The researchers used a five-point scale, where 1 represents no semantic equivalence and 5 represents exact semantic equivalence.

The new metric emphasizes adequacy rather than fluency, mainly due to the fact that assessing fluency is much more subjective. The study noted that subjectivity leads to higher variability and the preservation of meaning is a pressing challenge in many low-resource language pairs.

The authors compared XSTS to Direct Assessment (i.e., the expression of a judgment on the quality of MT output using a continuous rating scale) as well as some variants of XSTS, such as Monolingual Semantic Textual Similarity (MSTS), Back-translated Monolingual Semantic Textual Similarity (BT+MSTS), and Post-Editing with critical errors (PE).

They found that “XSTS yields higher inter-annotator agreement compared [to] the more commonly used Direct Assessment.”

Cross-Lingual Consistency

“Even after providing evaluators with instruction and training, they still show a large degree of variance in how they apply scores to actual examples of machine translation output,” wrote the authors. “This is especially the case, when different language pairs are evaluated, which necessarily requires different evaluators assessing different output.”

To address this issue, the authors proposed using a calibration set that is common across all languages and consists of MT output and corresponding reference translation. The sentence pairs of the calibration set should be carefully selected to cover a wide quality range, based on consistent assessments from previous evaluations. These scores can then be used as the “consensus quality score.”

Evaluators should assess this fixed calibration set in addition to the actual evaluation task. Then the average score each evaluator gives to the calibration set should be calculated.

SlatorCon Remote June 2024 | $ 180

SlatorCon Remote June 2024 | $ 180

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

According to the authors, “The goal of calibration is to adjust raw human evaluation scores so that they reflect meaningful assessment of the quality of the machine translation system for a given language pair.”

Given that the calibration set is fixed, quality is fixed, and the average score each evaluator assigns to any sentence pair in the set should be the same. Hence, the score assigned by each evaluator and the official fixed score can be used to make adjustments to each evaluator’s score. 

“If this evaluator-specific calibration score is too high, then we conclude that the evaluator is generally too lenient and their scores for the actual task need to be adjusted downward, and vice versa,” explained the authors.

For example, if the consensus quality score for the calibration set is 3.0 but an evaluator assigned it a score of 3.2, then 0.2 from all their scores for the actual evaluation task should be deducted.

The authors concluded that the calibration leads to improved correlation of system scores to subjective expectations of quality based on linguistic and resource aspects, as well as to improved correlation with automatic scores, such as BLEU.