Machine translation (MT) has been shown to be stronger and improving more quickly for languages where there is lots of reference data. One area where such data has historically been lacking is Africa, whose 2,000-plus languages are underrepresented in the world of natural language processing (NLP), according to Masakhane project co-founders and chief investigators Laura Martinus and Jade Abbott.
The two South Africans have described a self-defeating cycle in which speakers believe that their languages will not be accepted as prime modes of communication. This, in turn, leads to a lack of funding for translation projects and a dearth of language resources; those that do exist are often siloed in country-specific institutions.
Inspired by the Deep Learning Indaba theme for 2018, Martinus and Abbott started the Masakhane project (whose name means “we build together” in isiZulu) to connect NLP professionals in different countries, with the ultimate goal of translating the Internet “and its content into our languages, and vice versa.”
Now, over 60 participants in 15 countries are involved in a continent-wide effort to build MT models for African languages. (The Masakhane project also collaborates with RAIL Lab at the University of Witwatersrand and Translators Without Borders.)
The plan: Gather language data and develop MT models, which will then be analyzed and fine-tuned.
Martinus and Abbott have already trained models to translate English into five of South Africa’s 11 official languages (Afrikaans, isiZulu, Northern Sotho, Setswana, Xitsonga) using Convolutional Sequence-to-Sequence (ConvS2S) and Transformer architectures. They presented their findings at the 2019 Annual Meeting of the Association for Computational Linguistics (ACL).
Since being profiled by VentureBeat in November 2019, the group has continued its work with a range of languages, and made a point of making any gains publicly available to combat the “low discoverability” of relevant resources, a major challenge for many African languages.
Chief Investigator Kathleen Siminyu told Slator that the project now has 16 languages with benchmarks, which can be seen on the Masakhane project’s GitHub page.
“We are currently getting a lot of submissions, so this number is increasing often,” Martinus told Slator. “There are a few people I know who want to submit benchmarks soon, but have yet to finish up.”
On a less field-specific platform, Abbott tweeted on January 22, 2020 that contributor Julia Kreutzer, a PhD student in Germany, had “used JoeyNMT to train an English-to-Afrikaans model and deploy it as a slack bot on our @MasakhaneMt slack account (Afrikaans chosen because as a German speaker, she could sorta figure out that it was sorta working).”
The Masakhane project plans to present at the AfricaNLP workshop set for April 2020 in Ethiopia. “At the moment, it looks like we will submit six papers, maybe more,” Siminyu said.
Martinus added that many Masakhane participants are also currently writing papers for the first workshop on Resources for African Indigenous Languages (RAIL) in May 2020, to be hosted by the South African Centre for Digital Language Resources (SADiLaR).