India’s linguistic landscape is characterized by a diverse array of languages, with 22 scheduled languages recognized in the Constitution. These languages, including Assamese, Bengali, Bodo, Dogri, Konkani, Gujarati, Hindi, Kannada, Kashmiri, Maithili, Malayalam, Marathi, Manipuri, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu, are spoken by approximately 97% of the population.
While English often serves as a common medium of communication, high-quality machine translation (MT) systems for Indian languages are crucial for effective communication, social inclusion, equitable access, and national integrity.
However, the progress in open, accessible MT for Indian languages has been relatively slow compared to global advancements in the field.
A recent collaborative effort involving researchers from the Nilekani Centre at AI4Bharat, IIT Madras, Microsoft, EkStep Foundation, National Institute of Information and Communications Technology in Kyoto, Japan, and the Institute for Infocomm Research (I2R) in Singapore has shed light on the existing limitations in Indic MT models.
As the researchers explained in a paper published on June 17, 2023, there are some MT models that “either do not have a good coverage of Indian languages, or their performance on Indian languages is poor, or both.”
The primary reason behind these limitations is the absence of parallel training data that spans all 22 languages. Additionally, the lack of robust benchmarks specifically designed for Indian languages has hindered the evaluation of MT models in diverse domains or content of Indian origin. “There are no robust benchmarks designed explicitly for Indian languages,” the researchers said.
To address this gap, with their recent work they have made significant contributions to “wide, easy, and open access to good MT systems for all 22 scheduled Indian languages.”
Their work focuses on four key areas: curating and creating larger training datasets, creating diverse benchmarks, training multilingual models, and releasing models with open access. According to the researchers these contributions “pave the way for advancements in Indic MT” and offer promising prospects for improving machine translation in India.
Training and Evaluation
One crucial aspect of improving MT is the availability of comprehensive training datasets. In this regard, the researchers released the Bharat Parallel Corpus Collection (BPCC), which is the largest publicly available parallel corpora for Indic languages. BPCC comprises 230 million bitext pairs. This includes 126 million newly added pairs, including 644,000 manually translated sentence pairs. The BPCC dataset serves as a valuable resource for training and refining MT models, ensuring greater accuracy and linguistic nuances in translations.
Besides training datasets, the absence of robust benchmarks specifically designed for Indian languages has also been a significant hurdle in advancing MT in India. To address this limitation, they created the first parallel benchmark, IN22, covering all 22 Indian languages.
This benchmark encompasses diverse domains such as news, entertainment, culture, legal, and India-centric topics, featuring Indian-origin content, and source-original test sets. IN22 allows for the evaluation and comparison of translation models specifically tailored to the Indian context. According to the authors, IN22 and BPCC are “first-of-their-kind evaluation and training corpora covering 22 Indic languages.”
A Breakthrough in Indic Machine Translation
However, the significant breakthrough in Indic MT is the development of IndicTrans2, the first MT model capable of supporting all 22 scheduled Indian languages. As the researchers said, IndicTrans2 surpasses publicly available open and commercial models in performance on multiple benchmarks — both existing and new — on both automatic and human evaluation metrics. More specifically, the findings indicate that IndicTrans2 outperforms Google, NLLB 54B, and GPT3.5, while performing comparably to Azure.
To ensure widespread usage and facilitate further research in this field of Indic MT, the researchers have released their models and associated data on GitHub under permissive licenses. This open access approach encourages widespread usage, facilitates further research, and fosters collaboration among researchers, developers, and language enthusiasts in the field of Indic MT.
“We believe our work addresses the pressing need for high-quality and accessible models and makes significant contributions towards that,” said the researchers. Moreover, they hope that they will “inspire continued research, collaboration, and open access initiatives in Indic machine translation, empowering individuals and organizations to communicate seamlessly across linguistic boundaries.”
Authors: Jay Gala, Pranjal A. Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, and Anoop Kunchukuttan