Basic Data Augmentation Beats LLMs in Boosting Low-Resource Language Performance

IndiText Boost to Low Resource Languages

Onkar Litake, Niraj Yagnik, and Shreyas Labhsetwar from the University of California, San Diego, demonstrated in a January 23, 2024 paper that basic data augmentation techniques are more effective than large language models (LLMs) for improving model performance in text classification tasks.

The authors compared various data augmentation techniques for text classification including easy data augmentation (EDA), back-translation, paraphrasing using LLMs, text generation using LLMs, and text expansion using LLMs, in six Indian languages: Hindi, Telugu, Marathi, Gujarati, Sindhi, and Sanskrit. For each of the six languages, they applied data augmentations to two tasks: i) binary classification and ii) multi-class text classification. 

As the authors explained, the main motivation for this work was the lack of research on data augmentation for Indian languages, despite its potential to enhance natural language processing (NLP) tasks such as news classification, hate detection, emotion analysis, sentiment analysis, and spam classification.

Minimal Attention

They noted that while extensive work has been done on data augmentation for the English language, minimal attention has been given to Indian languages, even though data augmentation is often employed to overcome challenges related to data scarcity in low-resource language settings.

The authors fine-tuned the pre-trained BERT model for each language and task, leveraging augmented datasets, and compared the performance against the baseline model. They underscored that “no such work exists for text augmentation in Indian languages.”

Basic Data Augmentation Techniques Surpass LLMs

The results revealed that the augmentation methods consistently outperformed the baseline models for both binary and multi-class classification tasks across all the languages, underscoring the efficacy of data augmentation techniques in enhancing model performance for low-resource Indian languages.

Among the methods, basic data augmentation techniques outperformed LLMs. Specifically, EDA emerged as a clear winner, demonstrating consistent performance across all languages. Surprisingly, the random delete methodology also displayed good performance, despite seemingly removing information from each sentence. 

The authors acknowledged the limitations of their work, citing the restricted scope to a specific set of languages due to the unavailability of word embeddings for most other Indian languages. Additionally, they expressed their intention to explore more augmentation techniques in future work.