9 months ago
August 20, 2020
New Research: Do Translation Equivalents Really Mean the Same Thing?
To paraphrase Shakespeare, does a rose smell just as sweet if it is called rosa, ruža, or τριαντάφυλλο?
In a paper published on August 10, 2020 in the journal Nature Human Behavior, the authors — Princeton Computer Scientist Bill Thompson, Seán G. Roberts of Cardiff University, and Gary Lupyan of the University of Wisconsin-Madison — used an algorithm to determine whether translation equivalents really mean the same thing in each language.
The answer, it turns out, depends on how similar speakers’ cultures are — in other words, it is easier to translate between two languages spoken by groups with similar cultures than between languages whose speakers have very different cultures. “Our results do not fully fit into either the universalist or relative perspectives,” the researchers wrote, referencing two opposing schools of thought in linguistics.
From a universalist viewpoint, concepts integral to the human condition exist independent of language, and vocabularies are used to name those concepts. By contrast, a relative perspective states that language vocabularies are influenced by culture, and speakers come to understand concepts, categories, and types while learning the language.
Although research, such as the 2007 study on “the Russian blues,” has suggested that language may affect perception, a final verdict in the universalist-relative debate has been hard to come by without a consistent way of quantifying similarities between languages.
Past studies have also typically been limited to the comparison of two languages at a time. The authors view their research “as an early attempt to quantify semantic alignment at scale using distributional semantics.”
To compute semantic alignment (that is, the relationships between words with similar meanings), researchers looked for the range of contexts in which a given word was used and the frequency with which it was used.
“Our results do not fully fit into either the universalist or relative perspectives”
Their main analyses applied the fastText skipgram algorithm to language-specific versions of Wikipedia, and analyses were replicated using embeddings derived from OpenSubtitles2018 database and from a combination of Wikipedia and the Common Crawl dataset.
For each word (such as “beautiful” in English), the algorithm identified semantic neighbors, words that often appear nearby (e.g., “colorful,” “love,” and “sparkle”). It then translated those semantic neighbors into a target language (for example, French) and calculated their semantic similarity to the French equivalent, “beau.”
Next, the algorithm identified semantic neighbors of “beau” in French and translated them into English. The final similarity score for a word’s meaning quantifies how closely the semantics aligned in both directions of the translation.
This process was repeated for word forms for 1,010 concepts in 41 languages across 10 language families. Drawn from the NorthEuraLex (NEL) dataset, which is compiled from dictionaries and other linguistic resources that are available for individual languages in Northern Eurasia, those words spanned 21 semantic domains, including both concrete and abstract concepts.
The final similarity score for a word’s meaning quantifies how closely the semantics aligned in both directions of the translation
Humans were tasked with validating the computed semantic alignment, and researchers found a strong correlation with the similarity judgments made by native speakers and the algorithm in Dutch–English translation pairs, as well as a set of Japanese–English translatability ratings for 192 word pairs.
Most notably, the team used the semantic alignment measure to predict how consistently speakers of six languages would use the same term to name 750 images. Meanings with lower semantic alignment between languages were associated with less consistent name agreement across the six languages.
This exercise also confirmed the researchers’ prediction that larger differences in name agreement corresponded to lower overall alignment. For example, when shown a picture of a clothes hanger, 100% of Spanish-speakers called it “percha”; 77% of English-speakers called it a hanger; and 33% of Italian-speaking subjects called it “appendino.” Accordingly, Spanish and Italian might have a lower alignment than would Spanish and English.
To be fair, the study found that there are some “universally translatable” words, though not words associated with natural or very concrete meanings as expected. Instead, domains with fewer dimensions by which to organize terms were most alignable; namely, number words, temporal terms, and common kinship terms.
The cultural correlation was strongest for words related to food and drink, time, animals, and the body
“Although kinship systems vary, terms denoting close kin relations are organized along a few dimensions, such as gender (son/daughter, mother/father) and generation (grandmother/mother/daughter). This low dimensionality seems to enable high alignment,” the authors wrote.
This explanation alludes to, perhaps, the most compelling part of the study, where researchers applied another algorithm that quantified the overall similarity of two cultures that produced different languages.
The algorithm analyzed the proportion of cultural traits in common, based on an anthropological dataset of 92 non-linguistic cultural traits of 39 societies, and compared features such as marriage practices, legal systems, and political organization of speakers.
“Semantic alignment between languages is better predicted by cultural similarity than by the geographical proximity of the populations who speak them”
Within semantic domains, the cultural correlation was strongest for words related to food and drink, time, animals, and the body. Cultural similarity also corresponded to higher alignment for specific cultural domains. For example, two cultures with a similar “subsistence type” show higher semantic alignment in related cultural domains, such as “food and drink,” “animals,” and “agriculture and vegetation.”
For 19 Indo-European languages for which detailed information on historical and geographical proximity was available, researchers also found that historical proximity correlated more closely with semantic alignment than did geographical proximity. Ultimately, they concluded that “semantic alignment between languages is better predicted by cultural similarity than by the geographical proximity of the populations who speak them.”