Language data is big business. This sub-industry that deals with training corpora for language technologies ranging from natural language processing to machine translation is enjoying a resurgence thanks to AI.
Basically every language-related, AI-powered technology is driving demand, from speech recognition, sentiment analysis, question-answering and summarization, and of course, neural machine translation (NMT). Language data had always been necessary for technologies such as statistical MT, but NMT and any neural network-based solution is even more data hungry. What’s more, these technologies require high quality, domain-specific language data to provide equally high quality output.
The boom in language data has become so pronounced that companies like Appen have had a “truly outstanding” 2017, breaking through billion dollar valuation. The Australia-headquartered, Sydney Stock Exchange-listed company has two business lines: a Language Resources Division that provides datasets (audio, text, image and video) for training AI engines, and a Content Relevance Division that helps clients train AI driven products (mainly search engines) via human evaluation and feedback. And the growth has not stopped for Appen either, with their first half 2018 results have seen their shares reach an all-time high.
KT Feels the Language Data Crunch
Yet the growth in language data also means that demand so far outstrips supply, as South Korea’s largest telecom company, KT Corp., has found.
On January 2017, KT launched GiGA Genie, an AI-powered voice assistant. According to Kang Da-Som, a Manager at KT Corp, the service had a million subscribers by July 2018 and they intend to increase that to 1.5 million by the end of the year.
GiGA Genie’s focus is not search or ecommerce, however, like Google Assistant or Amazon Alexa, but media services. Indeed, GiGA Genie first made news by being installed in hotels. The service responds to commands in English and Korean, and offers smart concierge services to partner five-star hotels in South Korea, with more languages planned to follow.
“The biggest difficulty has been acquiring sufficient data on languages” — Kang Da-Som, KT Corp
According to Da-Som, the translation is performed via MT, but she did not provide additional details. She did offer, however, that “the biggest difficulty has been acquiring sufficient data on languages.”
The lack of high quality language data is a major concern for the NMT research community, and is a practical problem for the likes of Facebook and Google, and of course, language service providers that are already engaged in, or intend to enter, the NMT race. This is even more of an issue for low-resource languages, where parallel training corpora are scarce.
Machine Learning in the Spotlight
Meanwhile, machine learning featured prominently in Lionbridge’s latest rebranding this September 2018. “The company is increasing investments in its rapidly growing Machine Intelligence division to provide innovative solutions to assist customers in moving into new global markets,” their press release read.
Google has also directly contributed to more machine learning buzz when it released AutoML Translation, essentially a machine learning feature that allows its cloud clients to use their own language data for customized training of Google Translate’s NMT engines.
This potentially marks another wave of demand growth that spurs companies that provide language data like Appen, and creates more clients that need it, like KT Corp.