5 months ago
April 23, 2019
The State of Neural Machine Translation for Asian Languages
From being a fringe academic research topic in 2014 to the de facto industry standard in 2018, neural machine translation (NMT) has been a landscape-changing trend, especially in the last two years.
There are also academic research roundups that shine a light on trends like corporates accelerating NMT research or the occasional noise-making news out of China, such as Sogou leveraging “language-centric AI” for growth.
The coverage, however, may not fully do justice to NMT research, development, and deployment for Asian languages. Researchers face unique challenges in Asian language NMT because many languages are low-resource, while several key ones are written in non-Roman scripts. For providers, unlocking high-quality NMT in these languages opens up new market demand in some of the world’s fastest-growing emerging economies.
The International Conference on Asian Language Processing, held in Bandung, Indonesia from November 15 to 18, 2018 was neither the first nor the only conference of its kind in the region. But what happened during the event can give a general impression of the state of Asian language NMT. Six papers presented during the event directly tackled NMT.
Researchers face unique challenges in Asian language NMT because many languages are low-resource.
Slator reached out to two experts in the field. The first was Wuying Liu, one of the authors of a research paper presented at IALP 2018 and who also organized the track called “Machine Translation & Language Resources.” Second, Lucy Park, a research scientist working on machine translation at Naver. South Korea’s largest search engine with over 70% market share, Naver has a translation app with 10 million users. What follows are highlights of Slator’s conversation with them.
Slator: You spoke on Language Resource Extension for Indonesian-Chinese Machine Translation at IALP 2018. What are your next steps after this research?
Wuying Liu: We plan to build machine translation systems, from eight languages of 10 Asean countries to Chinese. We would like to combine machine translation with particular fields for major languages, such as English-Chinese machine translation in the field of patents. We hope our efforts can contribute to NLP (natural language processing) as well as serve trade cultural communications.
Slator: Given that many languages you focus on are low-resource, how do you compensate for the lack of training data?
Liu: Resources of our corpus are diverse. Teachers and students who study low-resource languages give us a lot of support. In constructing a corpus, NLP techniques are used to align them, combined with proper manual intervention. Building a well-structured bilingual corpus is one of the key parts of our research. As technology advances, mature and powerful code is not a problem. Instead, large-scale corpus resources are the bottleneck for many scholars.
Slator: Can you comment on the state of language resources in machine translation for SEA languages? Where do researchers and companies get training data?
Liu: To be honest, the data for SEA languages is fairly scarce. Take Filipino as an example. We can hardly find any bilingual sentence-aligned corpus. So it is quite meaningful to build corpus for languages like that and we are still working hard on this. One possible solution is comparable corpus. It is a collection of text pairs that are different in language (untranslated) but with a similar content or topic.
Slator: Which techniques do you use at training time to alleviate the low-resource constraints of these languages?
Liu: Corpus with high quality is our first choice. In addition to that, transfer learning among cognate languages is an effective solution to alleviate constraints. We have also noticed that recent research on GAN (generative adversarial network) and other unsupervised learning are very popular, which makes the neural network less dependent on large-scale training data.
As technology advances, mature and powerful code is not a problem. Instead, large-scale corpus resources are the bottleneck for many scholars. — Wuying Liu
Slator: How would you characterize the state of NLP and NMT for Asian languages? How mature is the research compared to the West?
Liu: Research on NLP and NMT for Asian languages is ascendant and promising. It is still in the stage of resource accumulation due to its features. Most Western languages are cognate languages and their letters are basically Latinized. Organizations like the EU play an important role in the field of language resources and translation. However, compared to the West, Asian languages have different sources and the spectrum is diverse. The characters of Asian languages are complicated. The lack of corpus also causes difficulties.
Lucy Park: Most of my experience is focused on Korean and machine translation, so my replies will tend to be biased. Since most of the recent approaches for machine translation are language-independent, we can say there is not much of a difference in terms of usage of models. However for MT, as many academic papers are benchmarked with major language pairs such as EN-DE or EN-FR, we always have to check if any paper is effective with Asian languages. I believe that is also the case for other NLP tasks.
To foster active research in NLP for Asian languages, I believe that areas where language-dependent knowledge is frequently required need to be explored and discussed more. As many Asian languages have their own distinctive writing systems and grammar, methodologies for tasks, such as data acquisition, preprocessing, and evaluation, should be appropriately adapted from that of major Western languages.
In order to do so, we would need more open data that is usable for research and free from copyright. To take POS (part of speech) tagging for the Korean language as an example, the only publicly available dataset is the one created from 1998–2007, namely the Sejong Corpus. It was created during a government-funded project. For MT, parallel data between Asian languages — and, more generally, not to or into English — are very scarce.
As many academic papers are benchmarked with major language pairs such as EN-DE or EN-FR, we always have to check if any paper is effective with Asian languages. — Lucy Park
Open source for Asian language NLP is getting more and more active, but it would be useful to have more projects that are both frequently updated and popular. Sometimes, code licensing plays a negative role, because many old projects are GPL (General Public License). Jieba, Rakuten MA, KoNLPy are some frequently-used libraries for CJK (Chinese-Japanese-Korean) NLP. (Lucy Park is a KoNLPy developer.)
Slator: We observed an uptick around Chinese-Japanese-Korean in research papers submitted to arXiv since last year. Has there recently been more interest in other Asian languages as well?
Liu: Thanks to the macro-environment in Asian countries, there is an increasing demand for inter-communications, which promotes research on these languages. However, most research is mainly conducted by scholars in countries with higher economic levels. In addition to CJK languages, there are more concerns around Southeast Asian languages, such as Indonesian, Malay, and Vietnamese.
Park: “WAT (Workshop on Asia Translation) has recently been including minor Asian languages, and that might be good evidence to say research efforts are spreading.
Slator: What do you regard as the most notable or promising NLP and NMT research areas for Asian languages?
Liu: With economic development [comes] more business and trade exchanges between Asian countries. Applications in trade, e-commerce, and technology are popular and promising.
Park: Low resource NLP/MT: for MT, non-English to non-English translation is particularly difficult.
NLP with non-Latin character sets: As lexical pattern matching still plays an important role in modern NLP, using different character sets with different characteristics alters the usage of models.
- Chinese/Japansese don’t have any spaces in between words, Korean has several, but morphemes are agglutinated, which makes tokenization an important task.
- Number of character sets are huge: The Korean alphabet (Hangul) consists of 11,172 letters, whereas the English alphabet has 26.
- Chinese/Japanese use logographs. Korean syllables also consists of Jamos. These are all decomposable units.
Overcoming cultural or language differences: for example, the subject in a sentence is frequently omitted in Korean. To translate Korean to English, we need to infer the subject in the sentence. [Additionally], there are honorifics in the Korean language, which reflects the relationship between the speaker and the audience. Regarding this difference, Papago recently released a feature that translates English to an honorific form of Korean (article; video).
Slator: China’s been very busy with companies like Sogou, Tencent, and Alibaba very active in research and events. Which notable companies, organizations, and academic institutions have you noticed as being most active in the research space and making headway in NLP and NMT in other Asian languages?
Liu: In terms of academic institutions, Tsinghua University, China University of Science and Technology, Suzhou University, Northeastern University, and Harbin Institute of Technology have been conducting research and also making great progress on NLP and NMT.
To be honest, industrial applications on Asian languages are still few. But there are several outstanding companies. BAT (Baidu, Alibaba, and Tencent) are three competitive Chinese companies in IT, including NLP and NMT. Youdao and NiuTrans are also making progress in Asian languages.
Park: In Korea, Naver has also been sponsoring many major events, and publishing papers recently. In Japan, Rakuten IT. In India, IIT Bombay and Microsoft India.
Slator: Do you know of any noteworthy applications of NLP and NMT technology for Asian languages?
Liu: Some applications for Western languages are also applicable for Asian languages. For example, translation systems, information retrieval, opinion analysis, and so forth, are quite common now.
Park: Products and services are quite adapted to Asian culture, usages, and language. Recently, Rakuten and Baidu‘s showcases were very interesting. Papago, Naver’s machine translation service, focuses on translation between Korean and English, and other Asian languages. Clova is the AI Platform developed by Naver and LINE, and is serviced in Korea and Japan. The Clova team has released some open source projects recently.
Slator would like to thank Professor Kyunghyun Cho for his assistance, as well as Naver’s Zaemyung Kim and Vassilina Nikoulina for helping prepare Lucy Park’s responses.