3 years ago
September 15, 2016
How Neural Machine Translation Can Unlock Europe’s Digital Single Market
Europe is a multilingual continent and society. The European Union consists of 28 member states with a total of 24 official languages. Additionally, there are multiple regional and minority languages as well as immigrant languages in active use in the continent.
META-NET, a Network of Excellence consisting of 60 research centers from 34 countries, is dedicated to building the technological foundations of a multilingual European information society.
Since its inception in 2010, META-NET has been arguing that the European multilingual digital society can benefit immensely from sophisticated language technologies, especially multilingual technologies, to bridge language barriers with the goal of supporting multilingualism, cross-border and cross-culture communication and commerce and also inclusion with technological means.
The Digital Single Market
With regard to current European politics, the transition to an integrated and truly connected Digital Single Market (DSM) has been one of the main strategic goals of the European Commission (EC) for about two years now—the EC predicts up to EUR 400bn in economic growth by 2020.
Measures like eliminating roaming charges, improving legislation (especially copyright and data protection), and making cross-border payments easier are among the important and necessary preconditions. However, they are not sufficient to accomplish the overall goal.
If customers are hampered by language, online commerce will remain confined to fragmented markets, defined and restricted by language silos. Even the unacceptable suggestion for everyone to use English would not deliver a single market, since less than 50% of the EU’s population speaks English, and less than 10% of non-native speakers are proficient enough to use English for online commerce.
Approximately 60% of individuals in non-Anglophone countries seldom or never make online purchases from English-language sites; the number willing to purchase from sites in non-native languages other than English is much, much lower.
If customers are hampered by language, online commerce will remain confined to fragmented markets, defined and restricted by language silos.
As a result, no single language can address 20% or more of the DSM (German comes closest, as the native language of 19% of the EU’s population). Taking care of the top four EU languages (German, French, Italian, English) would still address only half the EU citizens in their native language. Even allowing for second-language speakers, no single language can address more than a fraction of the DSM.
Concentrating exclusively on the 24 official EU languages would exclude those European citizens from the DSM who speak regional or minority languages, languages of important trade partners, or languages of refugees.
Small and medium-sized European companies (SMEs) are a vital component of the DSM. However, only 15% of European SMEs sell online—and of that 15%, fewer than half do so across borders.
SMEs that sell internationally exhibit 7% job growth and 26% innovate in their offering (compared to a job growth of 1% and 8% innovation for SMEs that do not).
Only if Europe accepts the multilingual challenge and decides to design and to implement research and innovation-driven technology solutions, as well as a service infrastructure with the goal of overcoming language barriers, can the full economic benefits of the DSM be achieved.
Enabling and empowering European SMEs to easily use language technologies to grow their business online across many languages is key to boosting their innovation potential and to help them create jobs.
How to Overcome Language Barriers?
The borders between our languages are invisible barriers, at least as strong in their separating power as any remaining regulatory boundaries. They create fragmented and isolated digital markets with no bridges to other languages or markets, thereby hampering the free flow of products, commerce, communication, ideas, help, and thought.
Language barriers in the online world can only be overcome by (1) significantly improving one’s own skills in non-native languages, (2) making use of others’ language skills, or (3) through digital technologies. With the 24 official EU languages and dozens of additional languages, relying on the first two options alone is neither realistic nor feasible.
Relying on human services alone would exclude most SMEs because of the high costs. It would create a market that can only be successfully penetrated by large, consolidated enterprises
For specific types of content and purposes, specialized human language services, increasingly assisted by language technology themselves, will continue to play a major role in translating documents, creating subtitles for videos, or localizing websites into 20+ other languages.
However, relying on human services alone would exclude most SMEs because of the high costs. It would create a market that can only be successfully penetrated by large, consolidated enterprises, which is why cost-effective methods must be found to support market access for SMEs and European citizens.
To succeed, any SME must both excel in communicating its expertise in its market niche and be able to engage in two-way conversations with its customers online. The free machine translation services offered by a few tech giants are useful for giving users the gist of web content. But they cannot be easily and cheaply tailored to support the niche communication needs between SMEs and their customers.
Supplementing this with domain-tailored language services such as content and sentiment analysis, knowledge extraction, and multimodal online engagement is completely out of reach for SMEs aiming to engage the half of the EU consumers who do not enjoy English, German, French or Italian as their native language.
The connected and truly integrated DSM can only exist once all language barriers have been overcome and all languages are connected through technologies. Only advanced communication and information technologies that are able to process and to translate spoken and written language in a fast, robust, reliable, and ubiquitous way, producing high-quality output, can be a viable long-term solution for overcoming language barriers.
Language Technologies for the Multilingual DSM
Establishing such a multilingual digital communication infrastructure requires a big collective push that involves designing, implementing and deploying technologies, services and platforms, accelerating innovation, research and efficient technology transfer.
More than 70% of our languages are seriously under-resourced, actually facing the danger of digital extinction
While only a few of our languages are in a moderate-to-good state with regard to support through digital technologies, more than 70% of our languages are seriously under-resourced, actually facing the danger of digital extinction (e.g., Maltese and Lithuanian); even though it must be noted that support for these languages with smaller numbers of speakers is slowly improving.
Language technology is clearly the missing piece of the puzzle that will bring us closer to a fully integrated DSM. It is the key enabler to boosting growth in Europe and strengthening our competitiveness in the IT sector, which is getting increasingly critical for Europe’s future.
The DSM holds tremendous potential to transform the European economy and make it more globally competitive. All languages actively spoken in Europe are also used digitally: e-commerce shops, information pages, online services, encyclopedias, university pages, company websites, user-generated content, online videos, podcasts, radio stations, and other multimedia content all make use of the official, regional, and unofficial minority languages spoken in Europe.
These languages must also be covered and reflected by the DSM. To realize this, we suggest the putting in place of applications, platforms, and services based on language technologies. We recommend setting up the highly focused three-year Multilingual Value Program (MLV) to enable the Multilingual Digital Single Market.
A new Strategic Research and Innovation Agenda, launched in July 4-5, 2016 at the META-FORUM 2016 in Lisbon, Portugal, describes the MLV Program in greater detail. Not only do we have to take into account systems that combine language technologies and big data analytics, we also have to intensify our research efforts in machine translation (MT) in order to boost the quality of MT.
Using neural networks for MT was one of the key topics at META-FORUM 2016 because it is the prime candidate for delivering the needed quality boost.
Neural Machine Translation as Key Future Research Area
Neural machine translation recently achieved breakthroughs in machine translation research, and has established itself as the new state of the art. It represents a fundamental shift from earlier statistical models, the most popular being phrase-based statistical machine translation.
Phrase-based translation systems are a combination of various models that are trained independently, and that make strong simplifying modeling assumptions. In contrast, the neural MT model is a single artificial neural network, a machine learning method inspired by biological neural networks.
The most salient difference between the translations of phrase-based systems and neural systems is fluency
Artificial neural networks consist of neurons, whose activation from some input is expressed as a numerical value and weighted connections between the neurons along which the activations pass. Neural networks are a powerful tool to approximate arbitrary functions, and are commonly used in computer vision, speech recognition, and natural language processing.
Currently, the most widespread network architecture for neural MT consists of two main components: an encoder, which reads the source sentence word by word, and represents it as a series of numbers; and a decoder, which produces the target sentence word by word, each prediction conditioned on the numerical representation of the source sentence, and on a numerical representation of the previously produced target words.
Both the encoder and decoder are learned jointly by incrementally moving the network parameters, namely the edges that connect the neurons of the network, in the direction that maximizes the probability of the training data. Like in previous data-driven MT approaches, the training data consists of source language sentences that are paired with a human translation in the target language.
In practice, the most salient difference between the translations of phrase-based systems and neural systems is fluency; that is, how grammatical and natural the translation output looks.
Phrase-based models are good at producing locally fluent translations, but because of strong independence assumptions, have no good mechanisms to ensure that the sentence is globally grammatical.
In contrast, information can be passed over long-distance through the networks of neural machine translation, and the models have been shown to produce more globally fluent output. Prime examples are agreement phenomena in morphologically rich languages; and strong improvements in word order have also been observed.
As a result, neural machine translation has been judged favorably by various metrics compared to phrase-based models: higher similarity to human translations in automatic evaluation, higher rankings by human judgments, and reduced post-editing effort. The fast pace of improvement in the last year of research in neural MT gives rise to the expectation that the gap between neural MT and other approaches will widen in the near future.
The fast pace of improvement in the last year of research in neural MT gives rise to the expectation that the gap between neural MT and other approaches will widen in the near future
Despite these recent breakthroughs, machine translation is far from being a solved problem. Fluency has improved, but is by no means perfect. More importantly, there is still no guarantee that the translation preserves the meaning of the source sentence; such errors become even more striking when the translation is fluent, removing any doubt about its meaning. Sentences are still translated out of context, and any coherence beyond the sentence level is by luck rather than design.
On the other hand, neural models also open up new possibilities that were not feasible before. Parts of the network can be shared between different language pairs, and a universal translator—a single network that can translate in any direction among a high number of languages—may soon move from the realm of science fiction into reality.
We perceive neural MT as being one of the most important areas European LT and MT research need to concentrate on in the next five to ten years—not only to establish the Multilingual Digital Single Market but also reduce communication barriers on a global scale.
About the Authors
Georg Rehm is a researcher and project leader in the Language Technology lab at the German Research Centre for Artificial Intelligence (DFKI GmbH) in Berlin. He is the General Secretary of META-NET and Coordinator of the EU project CRACKER, which has initiated the Cracking the Language Barrier federation of organizations and projects working on technologies for a multilingual Europe.
Rico Sennrich is a post-doctoral researcher in the Machine Translation group at the School of Informatics, University of Edinburgh. Among his main research interests are neural machine translation and models and algorithms for syntax-based statistical machine translation.
Jan Hajič is a professor for computational linguistics and former director of the Institute of Formal and Applied Linguistics at the Charles University in Prague. He specializes in empirical NLP, machine translation, and treebanks. Since 2015, he has been serving as Chairman of META-NET.
Image: Jan Hajič speaking at the META-FORUM 2016