Document-level machine translation (MT) has lately received growing attention in the MT community. However, there is still no efficient and effective evaluation metric for document-level translation, despite various advances in modeling.
“Although currently an active community is working on developing document-level MT systems, their evaluation has primarily been performed at the sentence level,” wrote Sheila Castilho, Assistant Professor at Dublin City University in 2020. In 2018, Samuel Läubli, Textshuttle CTO, and Rico Sennrich and Martin Volk, Professors at the University of Zurich, had emphasized the need to shift to document-level evaluation.
From Sentence to Document Level
As reported in a 2021 study by Microsoft, there are two types of automatic MT evaluation metrics.
- Standard string-based metrics (e.g., BLEU, TER, ChrF)
- Current state-of-the-art pretrained metrics (e.g., BERTScore, BLEURT, COMET, Prism)
The latter leverage existing language models or sequence-to-sequence models to determine whether a hypothesis (i.e., raw MT system output) conveys the same meaning as a reference translation (i.e., the high-quality translation produced by a professional translator).
However, none of these are suitable for document-level MT evaluation as they focus on sentence-level translation quality and ignore discourse-level aspects. More precisely, they can neither distinguish document-level from sentence-level improvements in translation quality nor identify the discourse phenomena — such as anaphoric references, lexical coherence, cohesion, deixis, and ellipsis — which lead to context-agnostic translations, according to a 2022 study by Jiang, Liu, et al.
The same study introduced a novel automatic metric called “BlonDe” (Bilingual Evaluation of Document Translation) to expand the scope of automatic MT evaluation from sentence to document level. (The study did not compare BlonDe to pretrained metrics).
Amazon Study
In September 2022, Amazon researchers presented a simple method for extending pretrained metrics to incorporate context at document level and applied it to BERTScore, Prism, COMET, as well as COMET-QE, the free (quality estimation as a metric) version of COMET.
The authors compared a single hypothesis sentence to a single human reference translation sentence to get a score, just like in standard sentence-level metrics. They also included additional context (i.e., two previous sentences) from the reference translation when computing the contextual embeddings for both the hypothesis and reference sentence.
Once the hypothesis and reference sentence had been embedded, they discarded the extra context sentences before computing metric scores following the same process as in the corresponding sentence-level metric.
The Amazon researchers also measured system-level correlation with human judgments to test the effectiveness of the proposed document-level metrics.
Novel Ways for Document-Level Evaluation
The findings demonstrate improved correlation with human judgments when document-level context is added to pretrained models. Such improvements are probably due to better context exploitation.
In addition, the document-level metrics outperformed their sentence-level counterparts in around 85% of the tested scenarios (when excluding results on low-quality human references). As regards the document-level extension of COMET-QE specifically, the proposed method significantly improved its accuracy on discourse phenomena tasks, outperforming a dedicated baseline by up to 6.1%.
“A simple extension of the metrics permits them to take advantage of context,” wrote the team.
In particular, the authors observed improvements in the evaluation of pronoun translation; not only when the relevant information is present in a previous sentence, but also in the same sentence, indicating that additional context can be helpful in such cases as well. Besides pronoun translation, the approach also improves over both the sentence-level metric and the document-level MT at word-sense disambiguation.
Thus, the Amazon researchers concluded that, “to the best of our knowledge, our work is the first example of pretrained document-level MT metrics […] We believe that it could easily be extended to other pretrained sentence-level metrics.”
The MT community should adopt such metrics which take document-level context into account, according to the team. They also suggest that “any future research in metrics should explore novel ways to incorporate context.”