Big tech companies now treat the number of languages handled by their large language models (LLMs) as a point of pride and an indicator of technological prowess.
Multilingual LLMs train on text from many different languages at once, inferring connections between languages to bridge high- and low-resource languages, with machine translation (MT) as output.
But in many cases, multilingual models held up as emblems of commitment to linguistic diversity can actually exacerbate the “resourcedness gap” that exists between high-resource and low-resource languages, according to a May 2023 report by the nonprofit Center for Democracy and Technology (CDT).
The root of these issues is the problem the models are meant to address: the disparity of data between languages. English, the lingua franca of the Internet, generally has the most data available, while data for languages with far fewer speakers may be limited to translations of religious texts or Wikipedia articles (which may have been MTed to begin with).
A reliance on translated data for training and fine-tuning can backfire. The model may struggle to build accurate representations of words with different connotations across languages. Translationese, MT’s “uncanny valley,” can come close to human translation — until it presents language no native speaker would naturally use: oversimplified or overcomplicated sentences; repeated words; borrowing too much or too little from the source language.
It is difficult to filter out problems caused by MT, not only because they manifest inconsistently between languages and systems, but also because researchers sometimes do not even realize their model has been trained on MT. This is especially likely to be the case for low-resource languages whose limited data online was MTed from the start.
“Even benchmarks to test how well multilingual language models work in high and low resource languages are often translated from another language, leaving researchers with less of a sense of how well these models work on language as spoken by native speakers,” the report stated.
By Definition, Unequal
The fact is, the authors wrote, multilingual language models do not — and cannot — work equally well in all languages. Blame the so-called curse of multilinguality: “The more languages a multilingual model is trained on, the less it can capture the unique traits of any specific language.”
“In general, multilingual language models struggle with languages written in non-Latin scripts, language isolates, and families of languages less connected to those of high resource languages,” they wrote. “This threatens to create a poor-get-poorer dynamic for languages that are only similar to other low resource languages, as is the case with many widely spoken African languages including Swahili, Amharic, and Kabyle.”
Linguistic diversity is attractive PR, but with future research depending on profit, LLM developers often prioritize languages with generally wealthier speakers.
The impact can be felt in a model’s vocabulary. Developers may use shortcuts to keep the cost of computational resources down, decreasing the model’s vocabulary size — the total number of words a language model can choose from to predict the next word in a sentence — and the model’s ability to capture semantic relationships between words.
Other multilingual models trained on mostly English data may have their vocabulary skewed toward English. Common words might be missing from other high-resource languages, common sub-words might be left out from medium-resource languages, and certain letters might not be available for low-resource languages.
No Silver Bullet
With so many ways for multilingual models to go wrong, who has done it right? According to CDT, “Groups like BigScience pave the way” by publishing documentation about their content analysis systems, ideal for letting others know about the makeup of a model’s training data (e.g., which languages, how much data per language, source of datasets, etc.).
The current standard in big tech is for companies to talk up their advancements in blog posts and press releases. Larger companies have open-source research versions of their models that differ from those used in production. As a business strategy, tech companies are rarely willing to share even the most basic information about their systems.
CDT believes companies should invest in improving language model performance in individual languages, developing better benchmarks, and involving language experts and other stakeholders in the process.
Hiring locals to create data sets and develop benchmarks can be expensive and resource-intensive for models designed for many languages, CDT acknowledged, but “these actors are crucial to ensuring that labeled training datasets adequately capture the nuances and variations of a given language.”
Similarly, the authors wrote, investors should focus their efforts on creating self-sustaining, scholarly non-English NLP communities, such as Masakhane, AmericasNLP, and ARBML. Private companies can contribute financial support as well as share the non-English datasets they use to train LLMs. Government investments — such as the French government’s support of BLOOM — can make up for underinvestment by the private sphere.
Now, while the norms surrounding AI are very much in flux, the authors wrote, international standards bodies and regulatory agencies have an opportunity to disrupt the linguistic factors contributing to the digital divide.