As the world was about to plunge into the darkest days of the Covid-19 pandemic at the end of March, the UN Secretary General tweeted about battling that other enemy, “the ‘infodemic’ of misinformation,” highlighting the urgent need to “promote facts & science, hope & solidarity over despair & division.”
It was in light of this need that initiatives such as TICO-19 were formed. The initials stand for Translation Initiative for Covid-19 and brings together collaborators from Translators without Borders (TWB), academia (Carnegie Mellon University, Johns Hopkins University), language services (Appen, Translated), and Big Tech (Amazon, Facebook, Google, Microsoft).
The eponymous group behind the July 3, 20202 paper on preprint server arXiV.org pointed out that communicating to vulnerable populations about how they can protect themselves was crucial to stemming the tide of Covid-19 — which WIRED.com called “history’s biggest translation challenge” in a May 2020 article on TICO-19.
TICO-19 has made test and development data available to machine translation (MT) researchers in 35 languages — 9 high-resourced, pivot languages plus 26 relatively low-resource languages — to enable the translation of Covid-related content into those languages.
The research basically provides three things: (1) a collection of translation memories and technical glossaries for language service providers (LSPs), translators, and volunteers to help them work consistently and accurately; (2) an open-source, multilingual benchmark set with data for very-low-resource languages specific to the medical domain, which aims to track the quality of current MT systems and enable future research; (3) monolingual and bilingual resources for MT practitioners “to advance the state-of-the-art in medical and humanitarian (MT), as well as other natural language processing (NLP) applications.”
The main consideration in choosing the 35 languages, the researchers said, was the “potential impact of our collected translations and the humanitarian priorities of TWB.” The languages were divided into the following groups:
• Pivots – 9 major languages (i.e., lingua franca in large parts of the world); Arabic, Simplified Chinese, French, Brazilian Portuguese, Latam Spanish, Hindi, Russian, Swahili, and Indonesian.
• Priority – 18 languages classified by TWB as high-priority due to large demand from partners, such as the Red Cross; these include languages in Asia (Dari, Central Khmer, Kurdish Kurmanji in the Latin script, Kurdish Sorani in Arabic script, Nepali, Pashto) and Africa (Amharic, Dinka, Nigerian Fulfulde, Hausa, Kanuri, Kinyarwanda, Lingala, Luganda, Oromo, Somali, Ethiopian Tigrinya, Zulu).
• Important – 8 languages spoken by millions in South and Southeast Asia; Bengali, Burmese (Myanmar), Farsi, Malay, Marathi, Tagalog, Tamil, Urdu.
“Some of the languages have remained untouched by the AI and MT communities, and have no known tools or resources that have been developed for them.”
“Priority” and “Important” are those languages used in communities that, based on feedback from the field, “may be most susceptible to the spread of the virus and its potentially disastrous ramifications, mostly due to lack of access to information.”
So “overwhelmingly under-resourced” are some languages that they “have remained untouched by the AI and MT communities,” and no known tools or resources have, as yet, been developed for them, the researchers said. They added that additional languages, such as Congolese Swahili, Nuer, and Eritrean Tigrinya will soon be added to the collection.
The TICO-19 team concluded that their effort only addresses “a fraction of the needs for a fraction of the world’s languages.” However, they hope their research will have an immediate impact for the languages covered and, especially as it pertains to the translation benchmark, “allow the MT research community, both academic and industrial, to be more prepared for the next crisis where translation technologies will be needed.”