Large Language Models Beat Commercial MT Models for Arabic Dialects, Research Finds

Arabic Machine Translation

In a research paper published on October 23, 2023, researchers from the University of British Columbia and Abu Dhabi’s Mohamed bin Zayed University of Artificial Intelligence demonstrated that large language models (LLMs) perform well in translating Arabic dialects into English.

As the researchers note, Arabic comprises a diverse array of languages spoken by approximately 450 million individuals throughout the Arab world. 

This linguistic framework encompasses a wide spectrum of varieties influenced by temporal factors (e.g., historical vs. contemporary forms), spatial considerations (e.g., country-level distinctions), and sociopragmatic functions (e.g., standardized usage in government communication versus informal street language).

One blindspot in research, the team stated, has been how well LLMs perform in translating Arabic varieties into other languages. To explore this, they assessed the performance of ChatGPT (both GPT-3.5-turbo and GPT-4), and Google’s Bard when translating Arabic varieties into English and compared it against commercial machine translation (MT) systems like Google Translate and Microsoft Translator.

The evaluation included ten diverse Arabic varieties, including Classical Arabic (CA), Modern Standard Arabic (MSA), and various country-level dialectal variants, using automatic metrics such as BLEU, METEOR, and TER.

To assess the LLMs’ capability on genuinely unseen data, the researchers manually curated a multi-dialectal Arabic dataset for MT evaluation. This ensured a robust evaluation environment and shed light on LLMs’ performance on novel and previously untouched datasets.

Cultural Intricacies

The lack of public datasets for some dialects emerged as a significant challenge and made it difficult for the models to capture the nuances of these dialects. 

“Our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities,” the researchers pointed out.

Despite the lack of data, the researchers found that LLMs, on average, outperformed existing commercial MT systems in translating dialects, affirming that “LLMs […] are better translators of dialects than existing commercial systems.”

Specifically, GPT-4 demonstrated consistent superiority over GPT-3.5-turbo, except in scenarios involving few-shot examples where GPT-3.5-turbo achieved comparable performance. Moreover, the study highlighted that in the majority of the evaluated varieties, both GPT-3.5-turbo and GPT-4 outperform Bard, emphasizing their effectiveness compared to Bard for these language varieties. 

The researchers also aimed to identify the most effective prompts for instructing the LLMs. To that end, three prompt candidates were tested: a concise English prompt, an elaborate English prompt, and an Arabic prompt. The results indicated that the concise English prompt outperformed the others, aligning with previous research favoring English prompts for LLMs.

Furthermore, the study extended its evaluation to a human-centric analysis of the Bard model’s efficacy in following human instructions during translation tasks. The findings revealed a limited capability of Bard in aligning with human instructions in translation contexts.

Authors: Karima Kadaoui, Samar M. Magdy, Abdul Waheed, Md Tawkat Islam Khondaker, Ahmed Oumar El-Shangiti, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed