A new paper by two Microsoft researchers says that machine translation (MT) is largely stuck in a “decade-old sentence-level translation paradigm” — unable to take full advantage of document context.
According to authors Matt Post, Principal Researcher, and Marcin Junczys-Dowmunt, Principal Research Scientist at Microsoft, this issue has become increasingly evident given the competitive pressure from large language models (LLMs), which are document-based.
Although there have been many efforts to develop document-level machine translation, various factors such as architecture, training data, and evaluation difficulties have hindered progress.
In their April 25, 2023, paper, Post and Junczys-Dowmunt demonstrate that document-based MT works well with standard Transformer architecture, provided it has enough capacity. They used Marian, a neural machine translation framework, to train all of the models used in their study.
They showed that training with document samples sourced from back-translated data alone is effective, and that parallel data as a whole is even harmful. They explain that back-translated data is “more readily available” and “of higher quality compared to parallel document data, which may contain MT output.”
The authors obtained monolingual data from the web that contained document metadata and used the CCMatrix dataset for parallel data, which does not contain document information. They also used parallel data crawled from the web containing document metadata.
Important Hurdle
The authors explained that “an important hurdle in the path to document-level translation is the difficulty of evaluation.” The dominant paradigm for evaluating long-tail document phenomena has been contrastive evaluation, in which a system is evaluated based on its ability to discriminate between correct and incorrect translation pairs.
They demonstrated that contrastive test sets do not discriminate document-based systems, and they proposed generative variants of existing contrastive metrics that do.
“What is needed are metrics that directly evaluate a model’s generative capacity, rather than its discriminative ability,” the researchers said.
Samuel Läubli, Textshuttle CTO, and Rico Sennrich and Martin Volk, Professors at the University of Zurich, were the first to emphasize the need to shift to document-level evaluation as far back as 2018. More recently, efforts have been made to build automatic metrics that make use of context, such as BlonDE and Doc-Comet.
Starting Point
In their paper, Post and Junczys-Dowmunt identified three impediments to moving the field towards contextual translation, and provided workable starting-point solutions to all of them. They do acknowledge that their approach is only “a starting point” and there are still many unanswered questions.
They also said that the fact that high-capacity Transformers work well does not rule out the possibility of further improvements through more sophisticated approaches. According to them, the biggest unresolved issue is how to build scalable, trustworthy document metrics.