Large language models (LLMs), such as ChatGPT, have shown remarkable capabilities in performing a range of language tasks, including machine translation (MT). But how effective are they when it comes to low-resource languages (LRLs)?
A research paper published on September 14, 2023, delves into the translation prowess of ChatGPT and other LLMs across a diverse set of 204 languages, encompassing both high- and low-resource languages. According to the authors, this is “the first experimental evidence for an expansive set of 204 languages.”
Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig from Carnegie Mellon University underscored the need for such an investigation, noting that there exists a wide variety of languages for which recent LLM MT performance has never been evaluated. As a result, it is difficult for speakers of the world’s diverse languages to know how and whether they can use LLMs for their linguistic needs.
In addition, the authors emphasized that “the majority of LRLs are largely neglected in language technologies” in general with current MT systems either performing poorly on them or not including them at all. “Some commercial systems like Google Translate support a number of LRLs, but many systems do not support any,” they said.
The authors pointed out that their work differs from existing studies since the focus here is on end users. The inclusion of a remarkable 204 languages, which incorporates 168 LRLs, underscores the commitment to addressing the diverse needs of LRL communities, which are frequently overlooked in the discourse on language technology. “We include more languages than any existing work […] to address the needs of various LRL communities,” they explained.
They evaluated ChatGPT’s MT performance across the entire language set and compared it with NLLB-MOE as their baseline, as it is the current state-of-the-art open-source MT model with wide language coverage. Comparative evaluations were also carried out against results from subsets of selected languages using Google Translate and GPT-4.
In their exploration of MT prompts, they employed both zero- and five-shot approaches for ChatGPT MT. The evaluation metrics, spBLEU and chrF2++, provided a robust basis for assessing the outputs.
The results suggest that while ChatGPT models approach or even surpass the performance of traditional MT models for some high-resource languages, they consistently lag for LRLs. Notably, African languages emerge as a particular challenge, with ChatGPT underperforming traditional MT in a substantial 84.1% of the languages studied.
ChatGPT is especially bad at African languages 3/6 pic.twitter.com/CmCIRSTCtE— Nathaniel R. Robinson (@robinson_n8) September 15, 2023
Language Resources and Costs
The researchers also examined language features, including language resources, language family, and script, to assess the effectiveness of LLMs.
This analysis aimed to uncover trends that could guide end users in selecting the most appropriate MT system for their specific language. “Analyzing this may reveal trends helpful to end users deciding which MT system to use, especially if their language is not represented here but shares some of the features we consider,” they said.
According to the authors, a language’s resource level is the most important feature in predicting ChatGPT’s MT effectiveness, while script is the least important.
The authors stressed financial aspects as well, particularly as it pertains to LLM users. “We evaluate monetary costs, since they are a concern for LLM users,” the authors said. Few-shot prompts, despite their potential for modest improvements in translation quality, come at a higher cost due to charges for both input and output tokens.
The authors emphasized that they want to help end users of various language communities know how and when to use LLM MT. “We expect that our contributions may benefit both direct end users, such as LRL speakers in need of translation, and indirect users, such as researchers of LRL translation considering ChatGPT to enhance specialized MT systems,” they concluded.