As machine translation (MT) improves, quality assessment becomes more nuanced. Industry standard BLEU has come under fire for biasing MT output, and researchers have recommended including human feedback to improve MT evaluation.
Now, a June 2021 research paper has homed in on the need to evaluate terminology consistency in MT. Corpus- or sentence-level metrics, such as BLEU, measure general MT quality, but not whether constraints are used, or used correctly.
“While professional translators very commonly use terminologies and translation memories, most academic research on MT does not utilize them,” co-principal investigator Antonios Anastasopoulos told Slator. “This research direction will, hopefully, close the gap between academic translation researchers and the real-world practices of professional translators.”
Anastasopoulos, an assistant professor, and PhD candidate Md Mahfuz ibn Alam, both from George Mason University, collaborated with researchers from several institutions on the paper, On the Evaluation of Machine Translation for Terminology Consistency: Laurent Besacier, Matthias Gallé, and Vassilina Nikoulina from Naver Labs, Grenoble, and, from Facebook, James Cross and Philipp Koehn, who is also affiliated with Johns Hopkins University.
The team used Covid-19 terminologies, which provided readily available parallel data, to experiment with five languages across several MT systems.
A common paradigm for controlling MT output is constrained decoding, in which the beam search must adhere to hard constraints. As an alternative to this brittle and often computationally expensive method, researchers provided terminological constraints as MT input, in the form of additional annotations inline with source sentences. Unlike the standard method, these “soft” constraints do not guarantee that MT output will include specific terminology.
The team’s new metric, TERm, is a modification of TER, an edit distance-based metric. For a holistic evaluation, the researchers complemented TERm with both exact-match accuracy (whether a term appears in MT output according to specifications) and window overlap (whether a term is placed correctly in an output sentence).
‘Serious Translation Providers’ May Already Have This
Anastasopoulos commented: “I suspect that serious translation providers that have to adhere to high quality standards might already have similar metrics as part of their quality assurance process.”
The first step was creating annotations over the TICO-19 benchmark, a dataset Google and Facebook used to create Covid-19 terminologies in more than 100 languages in two weeks. Professional translators verified terminology matches over parallel data.
Testing the metric on TICO-19 systems, researchers found that TERm was more likely than BLEU to significantly penalize “cheated” systems, where an untranslated target was appended to the end of output. In a post-hoc analysis of several WMT20 systems, the team identified the best-performing systems for terminology and compared them to the overall highest quality systems.
“While one would expect that the best overall translation would also be the best in terms of terminology translation, there is only a mild correlation between BLEU scores and the term exact-match accuracy scores,” the authors wrote.
On the other hand, they also noted a high correlation between the term exact-match accuracy and the window overlap scores. “[This] implies that, encouragingly, a MT system that is good at handling terminology is also generally good at correctly placing the terms in the appropriate context.”
A group of human annotators working in English-Russian and English-French confirmed that the combination of metrics was able to distinguish correct term translations from incorrect or missing term translations.
The team has open-sourced the code for computing the metrics and released training data for five language pairs for a WMT21 shared task, in which participants will develop MT systems whose term consistency will be evaluated by the new metrics.
“There are several approaches for incorporating terminological constraints into an automatic translation system, so we hope the shared task will help us figure out which ones work better and whether we still need to do more research in this direction,” Anastasopoulos said. “Our next focus will be dealing with term translation in morphologically-rich languages.”