Big Tech’s focus on the translation of low-resource languages was recently highlighted when Facebook, on October 18, 2020, unveiled a model that would avoid using English as a pivot language between source and target languages. As reported by Slator, it was the “culmination of years of […] work in machine translation.”
Before that, there was Google’s research on massively multilingual neural machine translation (NMT) published back in July 2019, and more recent research on what the search giant calls “Complete Multilingual Neural Machine Translation.” As mentioned, the resulting NMT model that improved translation for languages with sparse training data was five years in the making.
Now from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) comes a model that can “automatically decipher a lost language without needing advanced knowledge of its relation to other languages.”
The CSAIL research team was led by MIT Professor Regina Barzilay, who has spent a couple of decades on language-related research, among other data science-driven topics. The team’s ultimate goal is for the system to be able to decipher lost languages that have eluded linguists for decades, using just a few thousand words,” wrote Adam Conner-Simons, CSAIL Communications Manager, on the lab’s blog.
The study was, in part, supported by US spy agency IARPA. Over the years, IARPA has invested in finding low-resource language models that can be queried in English by holding conferences, offering grants, and running contests.
Ultimate Low-Resource Challenge for Humans and Machines
The new CSAIL study builds on a 2019 paper, where the authors, including Barzilay, propose a new approach to the automatic decipherment of lost languages. “Decipherment is an ultimate low-resource challenge for both humans and machines. The lack of parallel data and scarce quantities of ancient text complicate the adoption of neural methods that dominate modern machine translation,” the researchers wrote.
However, while in the 2019 paper the languages used were known to be related to early forms of Hebrew and Greek, in the new CSAIL study — which evaluated the model on Gothic, Ugaritic, and Iberian — the relationship between languages is inferred by algorithm and thus, applicable to more undeciphered scripts than prior work.
Using the algorithm, the team was able to confirm recent scholarship that suggests Iberian is not actually related to Basque as previously believed.
According to Conner-Simons, “The team hopes to expand their work beyond the act of connecting texts to related words in a known language — an approach referred to as ‘cognate-based decipherment.’ This paradigm assumes that such a known language exists, but the example of Iberian shows that this is not always the case. The team’s new approach would involve identifying semantic meaning of the words, even if they don’t know how to read them.”
Image: Phaistos Disk, an undecoded clay disk from the Minoan palace of Phaistos, Crete dating to the Minoan Bronze Age; exhibited in the Heraklion Archaeological Museum, Crete, Greece.