Found in Swahili: 37 Teams Compete to Help US Intelligence Agency Query the World

Ask a question in English, find the answer in Swahili, and win USD 10,00020,000. This is the gist of the OpenCLIR (Open Cross Lingual Information Retrieval) Prize Challenge. OpenCLIR is organized by the Intelligence Advanced Research Projects Activity (IARPA), a US government research program under the Director of National Intelligence.

The OpenCLIR challenge has to do with building a system that, when queried in English, can extract relevant information from speech and text documents in low-resource languages. Languages for which there is very little available training data are considered low-resource.

According to IARPA Project Manager Carl Rubino, OpenCLIR was funded by IARPA “as an open evaluation of the MATERIAL program.” Rubino added that the US National Institute of Standards and Technology (NIST) runs the evaluation.

MATERIAL or Machine Translation for English Retrieval of Information in Any Language is a broader IARPA program that includes functions such as summarization and domain classification, aside from multilingual information retrieval. OpenCLIR is “a simplified, smaller scale evaluation open to all,” which zeroes in on “computationally underserved languages.”

Large Data Volumes in Sundry Less Studied Languages

According to the OpenCLIR evaluation plan, it is “one of several [capabilities] expected to ultimately support effective triage and analysis of large volumes of data, in a variety of less studied languages.” Successful systems should, therefore, be able to adapt to entirely new languages.

OpenCLIR was launched in July 2018 and began seeking out researchers the world over. Registrations for interested parties closed on November 30, 2018, and OpenCLIR passed its latest milestone on January 21, 2019: the release of data that participants will use to test systems. The deadline for participants to submit their algorithms will close on February 1, 2019, and winners will be announced on May 15, 2019.

The best systems in the speech and text files categories will win USD 20,000 and USD 10,000, respectively.

“We have 37 registrations from around the world” — Carl Rubino, Project Manager, IARPA

“We have 37 registrations from around the world, including five from India, four from China, two from Germany, as well as teams from the Czech Republic, the Netherlands, Japan, Poland, Romania, Russia, Taiwan, Italy, Turkey, Singapore, and Spain,” Rubino told Slator, adding that submitted systems will work with Swahili.

Rubino said they chose the low-resource language “based on a combination of its linguistic properties, availability of in-language electronic text and speech ‘in the wild’, and value to the academic and government communities.”

Millions into Multilingual Research

Aside from MATERIAL, the US government runs other programs concerning natural language processing (NLP) and neural machine translation (NMT). These include the USD 14m SCRIPTS (or System for Cross Language Information Processing, Translation, and Summarization) and the more than USD 26m LORELEI (or Low Resource Languages for Emergent Incidents).

Programs such as LORELEI have specific deployment scenarios in mind; think conflict or incident zones. Asked for comment on potential real-world deployments of OpenCLIR or resulting technologies, Rubino basically replied that the possibilities are open-ended.

“These technologies can be potentially deployed in any scenario where an organization needs to quickly develop capabilities to perform domain-specific triage of a large collection of speech and text documents in a language for which this organization does not have a large cadre of speakers of that language,” he said.