In October 2015, Slator reported Spain’s massive investment in natural language processing (NLP) and machine translation (MT) worth over $100 million. Announced by no less than the Vice President of the Spanish Government, Soraya Sáenz de Santamaría, the strategic five-year plan begins with the government pouring $16 million into the development of underlying NLP and MT technologies.
To better understand the impact of the multimillion dollar undertaking, Slator reached out to experts in the field. We talked to Tony O’Dowd, Founder and Chief Architect of machine translation software provider KantanMT, Dr. Roberto Navigli, associate professor at the Linguistic Computing Laboratory of the Sapienza University of Rome and the mind behind the multilingual encyclopedic dictionary and semantic network BabelNet, and Dr. Joss Moorkens, post-doctoral researcher at the ADAPT Centre in the School of Computing and a lecturer in Multimedia Translation at the School of Applied Language and Intercultural Studies at Dublin City University.
Spain decided to invest in NLP and MT because the country has several official languages and claims to have strong research capabilities in NLP. These were in fact cited as strengths in a SWOT analysis that justified the five-year plan. Kantan MT’s O’Dowd said Spain is a leading contender in statistical MT research and NLP applications, noting that it is a popular subject matter for PhDs and Master post-grads, and that Apertium, an open-sourced rules-based MT platform, was deeply rooted in the Spanish academic community.
Dr. Moorkens cited “University of Alicante (with spinouts like Prompsit), Polytechnic Universities of Valencia and Catalunya, and the University of the Basque Country” as examples of academic institutions carrying out NLP and MT research in Spain. Meanwhile, Dr. Navigli said he thinks other countries can also claim they are in a position to lead MT research, and said he believes they should make similar investments in NLP.
Yet Spain made the first bold move. “This investment has the potential of catapulting Spain into the very forefront of statistical MT research, rivalling that of Ireland and Germany,” O’Dowd said.
Dr. Moorkens and Dr. Navigli shared more tempered outlooks. “As the changes in EU policy forces those who rely on EU funding to diversify, it will certainly help NLP and MT companies in Spain,” Dr. Moorkens offered, while Dr. Navigli said that “this will depend on how funding will be implemented concretely and how companies and research bodies will contribute.” He added that the scale of language research will also affect the outcome.
But they all agree on one thing: the investment will have several positive consequences. The investment will most likely lead to “improvement of all human-language interfaces,” Dr. Navigli said, including “services which rely on human language technologies.” Dr. Moorkens was more focused on the academic impact: “It’ll bring in expertise, pay for postgrad and postdoc researchers who can publish, create new tools to improve quality [of] MT, and help existing research centres survive now that the EU’s Horizon 2020’s focus has moved away from MT.” O’Dowd looked forward to how the investment will reduce language barriers in Europe when conducting international business affairs, which he said will be good for trade throughout Europe.
In fact, O’Dowd indirectly expressed intent to get involved with the initiative: “Any statistical MT company that is not exploring how to get involved in this initiative will lose out. An investment of this importance simply cannot be ignored – if you do – you do so at your peril!”
What Does It Mean to “Vastly Improve NLP and MT”?
Google Translate is undoubtedly the most popular example of a statistical MT tool today. And it is far from perfect. But what does it mean now that Spain is going to throw millions at NLP and MT research? What sort of improvements should we expect?
“I don’t foresee anything other than incremental improvements in quality by 2020,” Dr. Moorkens said, “We can work on domain specificity and improve quality, but the quality is still likely to vary greatly by language pair and domain.” He noted that high quality MT without human post-editing need not be the end-all, saying integration in translation editor tools, among other areas, might also be explored. As far as getting rid of the need for post-editing, Dr. Navigli outright said it will be difficult to achieve.
O’Dowd was enthusiastic about seeing results, particularly about reaping the benefits of improved MT for international trade. “MT would be ubiquitous – today it’s the exception, it’ll be the norm in the future,” he said, “we’ll see improved data modelling, augmented statistical techniques that will ultimately lead to higher translation fidelity and quality.”
It appears the $100 million will not go to waste, but how exactly should it be spent?
According to Dr. Navigli, semantics is currently the most important area of NLP and MT. “While part-of-speech tagging and syntax are already solved with performances in the range of 80-90%, semantics still need to be integrated in MT and in many other NLP areas.” O’Dowd delved into how he thought the investment should prioritize the varied challenges it faces. “The first hurdle is the challenge of collecting, classifying, and aggregating suitable bi-lingual training sources,” he said, “One of the biggest challenges in building SMT engines is developing a stream of training data that has a high providence, especially for languages from the newer members of Europe.”
“Following on to this is the development of augmented data models that may incorporate POS information,” he continued, “Today’s SMT systems are constrained by data model limitations, enhancing these models will ensure higher model fidelity, better phrasal alignment and improved translation outputs.” Finally, he pointed out that post-editing will be the last hurdle.
“Todays’ CAT tools are not suitable environments for high-speed post-editing. We need to rethink our approach, ditching the rules of the 20th Century.”
The second “pillar” of Spain’s strategy, after investing in underlying technologies, is knowledge transfer between academia and the corporate sector, which is the usual approach in the EU. O’Dowd said the typical partnership or consortium approach “has proven to yield the most practical and useful results,” citing CNGL (Centre of Next Generation Localisation) and ADAPT as examples. He said because this approach ensures all academic research is grounded and relevant to industry needs, “It’s a winning combination all of the time!”
Dr. Moorkens, who conducts his research at the CNGL and ADAPT, agrees. He said that while this approach may be unlikely to lead to fully automatic high quality machine translation, “it makes no sense to eschew the existing knowledge and experience,” especially as any breakthrough ideas aside from this approach would presumably be presented and published. He added that he would like to see the money go to existing Spanish research groups.
The Future after the $100 Million Investment
How serious Spain takes this five-year plan is apparent in how they are backing the presentation of the strategy with a leading government figure. $100 million is big money and a serious investment, reflecting MT’s increasingly central role in translation and localization. “MT clearly has its role in the future of localisation,” Dr. Moorkens said. He noted, however, that this does not mean MT is “the be-all and end-all for every stakeholder.”
“Globalisation has created a large and growing market for translation and localisation, with more people employed than ever before, but the roles within it are changing, pushed by continual optimisation of the supply chain and the industry reliance on technology and outsourcing,” he said, “Technology is part of the working life of most of those in translation and MT has a part to play in that.”
Meanwhile, Dr. Navigli sees this latest investment as a confirmation that “MT is definitely key in many areas, both when the user benefits directly from it and when systems that integrate MT improve their performance.” He said the EU has been investing considerable amounts in MT for some years now.
He also highlighted “the fact that NLP… is funded is simply great news,” considering that the EU is not going in a similar direction direction with Horizon 2020.
“Language is not a priority in this Framework Programme,” Dr. Navigli said.
O’Dowd emphasized the significance of the investment for international business: “For a single, integrated market, goods and services must move freely, quickly and smoothly. The language barrier creates friction in this – removing the friction will accelerate commerce, company engagement and interaction and improve Europe’s ability to translate worldwide.”