In what one expert describes as an important and significant event for both academia and the language industry, the United Nations released its official Parallel Corpus in May 2016 made up of manually translated documents from 1990 to 2014 for the UN’s six official languages: Arabic, English, Spanish, French, Russian, and Chinese. While previous parallel corpora did exist, the UN says the 2016 corpus is the first one “published by the original data creator.” The corpus is vast, consisting of nearly 800,000 documents or slightly over 1.7m aligned document pairs.
Previous corpora include one published in 2009 made up of 2,100 UN General Assembly Resolutions. Another, MultiUN in 2010 modified in 2012, was published by EuroMatrixPlus, the EUR 6m project of the EU Information Society Technology program.
Says John Tinsley, “MultiUN has been used extensively in R&D. However, this one (2016 version) is significantly bigger and more up-to-date in terms of the UN documents used.”
Tinsley is CEO and co-Founder of Iconic Translation Machines. He tells us Dublin-based Iconic used the MultiUN corpus along with other non-UN related data in their machine translation (MT) software. Tinsley’s team had worked before with the World Intellectual Property Organization on other projects, particularly 2016 corpus authors Marcin Junczys-Dowmunt and Bruno Pouliquen. WIPO developed what he describes as an “extensive patent MT solution.”
Tinsley points out that, since WIPO is a UN organization, “they realized they could reappropriate their tool for more general UN content; a byproduct of this was the dataset they created, which was cleaned and prepared as we see now.”
Some of the languages in this release, namely Russian and Arabic, are languages for which data is not so widely available—John Tinsley, Iconic Translation Machines co-Founder
Meanwhile, Olivier Debeugny, Founder of Lingua Custodia, says they would still “need to perform thorough analysis of how clean the data is, and specifically check if there is a useful indexation of segments” for use in his company’s specialized translation engines.
Although Debeugny’s France-based financial translation automation firm has, in the past, used the UN corpus, they regarded it as “too generalist for what we aim to produce.” He concurs with Tinsley, however, about the size of the latest version, saying, “It seems, at first sight, that the amount of data available is multiplied by two in comparison to what was available before.”
Clean Training Sets Are Crucial
Tony O’Dowd, Founder & Chief Architect of SaaS-based machine translation platform KantanMT, calls the 2016 UN Parallel Corpus “important and significant, both from an academic and industry point of view.”
O’Dowd explains: “From an academic point of view, high quality, clean training data sets are central to the ongoing research and development of higher quality data cleansers, data modelling techniques, evaluation of new re-ordering methods and, of course, neural MT systems.”
He adds that gaining access to high quality training data improves the ability to benchmark engine improvements based on ameliorated research.
The UN Corpus will inevitably lead to better translation outputs for most high quality MT platforms—Tony O’Dowd, KantanMT Founder
From an industry viewpoint, on the other hand, gaining easy access to the UN Corpus “will inevitably lead to better translation outputs for most high quality MT platforms,” O’Dowd says. This in turn, will lead to broader SMT (statistical machine translation) usage.
About the quality of the corpus, the KantanMT founder says, that while the number of language pairs is relatively small, they are triangulated, which he sees as “very helpful.”
Another beneficial aspect, he says, is the UN published BLEU scores for each dataset, which O’Dowd points out will be helpful in repurposing the data for larger engines as well as offering “a good benchmark for the relative cleanliness of the datasets.”
BLEU or bilingual evaluation understudy scores rate, in layman’s terms, how close a machine translation is to a human translation. The highest BLEU scores for the 2016 corpus go to English-Spanish in the fully aligned subcorpus (i.e., sentences aligned across all languages with English primary documents), as well Spanish from and into English across the entire corpus.
O’Dowd says KantanMT will soon include the 2016 corpus in its library.
Filling a Language Data Gap
While the lowest scores go to the pairs to and from Chinese, Arabic, and Russian, the 2016 UN corpus is valuable to technology providers specifically because of such languages, notes Iconic CEO Tinsley.
“These are languages for which data is not so widely available, unlike French and Spanish, for instance;” training data Tinsley calls a crucial starting point for all MT engines.
He cites Russian and Arabic as being of particular value and says Iconic already uses the new corpus in “special similarity measures” for comparing client projects to the UN data, the point being to extract the most relevant segments to improve translation quality.
While, admittedly, not directly applicable in the same way for other language service providers, Tinsley points out, the 2016 corpus is still valuable as “supplementary training data to improve vocabulary coverage and language models.”
What becomes more important now as there are more [parallel corpora] available is to have various segments indexed…and an estimation of their usability for MT engine training—Olivier Debeugny, Lingua Custodia Founder
As Lingua Custodia’s Debeugny puts it, any free access to new parallel corpus is very useful to the MT industry. He adds, “What becomes more important now as there are more [parallel corpora] available is to have various segments indexed according to their domain and an estimation of their usability for MT engine training.”
All three experts agree that, as KantanMT’s O’Dowd says, “Successful SMT development starts with good training datasets. While the techniques used to build phrase-based systems are noise-tolerant, the better the training data, the better the quality of the eventual (MT) engine.”
He concludes, ”This presents interesting possibilities for the MT industry.”
At the end of their supplementary paper, the UN publishers state they hope to publish updated versions of the 2016 UN Parallel Corpus, expanding forward and backward from 1990 and 2014.
Inline image and reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.