We spoke to Casper Grathwohl, president of language content and data provider Oxford Languages (a Division of Oxford University Press) for his thoughts on the current and future state of the Indian languages data market, and how his business is empowering people and programmes in the development of digital India.
Where do you see opportunities and challenges for the Indian languages data market?
India has the world’s second-largest online population and it’s incredible to watch how fast it’s growing—536 million Indian language speakers are expected to be active online by 2021, more than double the number online in 2016. At Oxford Languages we’re excited by the potential and motivated by the challenge of enabling this massive, multilingual community to access and engage with online content in their own languages.
Despite the fact that Indian language speakers are expected to account for nearly three quarters of India’s internet user base by 2021, research by KPMG and Google undertaken a few years ago revealed that 60% of those surveyed cited limited language support and content as the largest barrier to their adoption of online services. That’s the challenge—there isn’t enough broad, clean, basic data available in some of these languages to effectively incorporate them into the everyday digital experiences we take for granted in resource-rich languages like English, Chinese, and Spanish.
This is especially problematic as Hindi-speaking internet users are predicted to surpass the number of English-speaking internet users in India, while digital trends forecasting anticipates that Bengali, Marathi, Telugu, and Tamil speakers will form 30% of India’s total internet user base by next year.
Enabling these internet users to access content in their own language is fundamental to transforming their online experience, but to do that the basic building blocks need to be put into place—high quality, domain-specific corpora; scrubbed wordlists containing rich lexical information; and structured multilingual dictionary and thesaurus entries. At Oxford Languages this is where we excel and are putting a huge amount of effort—creating the lexical building blocks to serve as a foundation for more advanced language experiences.
How is Oxford Languages transforming Indian internet users’ online experience?
Meeting the demand for digital content in Indian internet users’ native languages is a challenge that our team has made a priority over the last five years. We are committed to developing clean, structured lexical content for low-resourced languages, ensuring that speakers of these languages benefit from access to the digital world in their own tongue, and to experience full digital representation as communication technologies evolve.
To this end, Oxford Languages currently offer comprehensive datasets in 12 major Indian languages—including Gujarati, Hindi, Marathi, Tamil, and Urdu—all of which are available to license as standalone datasets and accessed through our self-service API, putting Oxford’s quality data at developers’ fingertips. These include wordlists, bilingual lexicons, dictionary entries, morphological analysers, and more.
Our language datasets are trusted by start-ups and big tech players alike thanks to their clean, structured, and flexible development, and have a huge range of outputs for use in India’s top digital trends, from digital payment to frontier-tech solutions like AI training and machine translation, voice recognition and text-to-speech areas.
For all of these use cases, authoritative, structured data is an essential foundation and we are working hard to create new content in the major Indian languages.
What is Oxford Languages doing that sets it apart from other LSPs in this space?
About two years ago we launched a dedicated Indian Languages Programme, which focuses exclusively on the creation of new lexical content in the major Indian languages—with a particular emphasis on the growing demand for bilingual/bidirectional language data and curated corpora.
With a team headquartered in Noida just outside of Delhi, the programme builds on 150 years of Oxford University Press’s experience in researching, writing, acquiring, curating, and delivering world-class dictionary content in languages from Arabic to Zulu. With this experience in our corner, we have established content creation hubs in India to bring the expertise of local translators and lexical consultants directly into the development of our Indian language content. Together with the tried-and-tested framework developed by our lexicographers and specialists in Oxford, this combination of expertise has enabled us to quickly create comprehensive bilingual content that fills the market gap for quality Indian language data to power developers’ programs and projects around the world.
The latest output of this ambitious programme, for example, is a comprehensive update to our English-Hindi bilingual dictionary data, which, just like our Hindi-English dataset, is available to license and access through our API. Expect further Hindi updates, new content from our Tamil content creation hub, new Indian language hubs, and new products like parallel corpora in the pipeline as we expand our offering to meet user needs.
What Indian language data solutions do Oxford Languages provide?
Whether you require an off-the-shelf Gujarati dataset or a wordlist tailored to domain specifications, we have business development representatives based in India and at Oxford Languages branches around the world available to learn about your unique use cases and explore how Oxford Languages can provide the language data solution that suits your project needs.
If you are interested in finding out more about our Indian Language Programme, register your interest and a member of the Oxford Languages team will be in touch.