Researchers Create a Large Language Model for Indonesia’s Languages Researchers Create a Large Language Model for Indonesia’s Languages

AI researchers Louis Owen, Vishesh Tripathi, Abhay Kumar, and Biddwan Ahmed, who work for customer service tech company Yellow AI, published a paper (download) in March 2024 describing their experience with the Komodo-7B-Instruct large language model (LLM). 

The Komodo-7B-Instruct model was built on the Llama-2 LLM. Interestingly, in 2023 Meta deemed the Llama-2 LLM as potentially not suitable for non-English use. The researchers claim that the Komodo LLM improves upon language translation services and “contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages.”

The model, designed specifically for Acehnese, Balinese, Banjarese, Buginese, Dayak Ngaju, Javanese, Lampungnese, Madurese, Minangkabau, Sundanese, and Toba Batak —as well as various dialects—, has seven billion parameters, as the 7B in its name indicates.

In the paper, the researchers explained that, with this model, they also sought to address known issues in other high-resource and multilingual LLMs, including English bias and underperformance in low-resource languages.

Grades 1-12 Textbooks as Data Sources

The datasets used in training and fine-tuning the Komodo-7B-Instruct LLM were created from open-source data and manually collected data. Sources included Indonesian textbooks on various subjects, colloquial data from movie subtitles, news, and informal conversations, according to the paper. 

Explaining that “a judicious selection of high-quality data has proven effective, even yielding State-of-the-Art perfor­mance under certain circumstances,” the researchers set out to create a model specialized in understanding. The resulting datasets addressed specific language traits, including language proficiency, cross-lingual understanding, common sense reasoning, sentiment analysis, and intent classification. 

The vocabulary used was expanded to include common Indonesian and regional words. The researchers identified and incorporated approximately 2,000 frequently used words in Indonesian and 1,000 words for regional languages not included in the Llama-2 model.

During the pre-training phase, Komodo-7B-Instruct refined its ability to position words, grouping similar words closer together in its memory. Other dataset preparatory steps included repetition removal (excessive repetition of words or phrases), quality filtering (filtering out low-quality or irrelevant data), and deduplication (removing duplicate entries).

Part of the model’s training also involved English datasets and alternate parallel data with all combinations of English, Indonesian, and the 11 regional languages. The researchers’ intention in doing so was to enhance the model’s understanding of code-mixed (multiple-language) sentences. They also used a bilingual next-token prediction strategy instead of a monolingual next-token prediction with translated Indonesian text.

Better Performance Across Tasks

According to the researchers, their Komodo LLM surpasses various multilingual models, including Cohere’s Aya-101, MBZUAI’s Bactrian-X-llama-7B, Qwen-1.5, Mistral’s Mixtral-8x7B-Instruct-v0.1, and AISingapore’s Indonesian SEA-LION LLM on multiple tasks against existing benchmarks, including Perplexity. It also surpasses Google Translate in scope (which supports only Indonesian, Javanese, and Sundanese). 

The model, say the researchers, excelled in intent classification, colloquial language detection, sentiment analysis across languages, and cross-language understanding (e.g., Indonesian-English). Komodo-7B-Base was also able to maintain the performance of Llama-2-7B ­Base across all tasks, except GSM8k, a math task. 

The Komodo LLM succeeded in designing and fine-tuning for “linguistic variations specific to the Indonesian context and its regional languages, enabling it to outperform in tasks related to Indonesian and regional languages,” added the researchers. 

Beyond commercial applications, one important use case for the model is its potential role in supporting a diverse set of Indonesia’s regional languages for educational purposes, according to the researchers. Their idea is that with the Komodo LLM “resources and information can be more widely disseminated, contributing to a more inclusive and equitable educational landscape throughout the country.”