5 months ago
June 10, 2021
Open-Source Tool Helps Users Anonymize Data in 24 EU Languages
The first results of a data anonymization project were released in May 2021 as a pre-beta online demo. Called “MAPA” (Multilingual Anonymization toolkit for Public Administrators), the project aims to help EU public administrators share data while staying compliant with data regulations.
MAPA is led by language service provider (LSP) Pangeanic, which was awarded EUR 1m in funding by the European Commission’s Innovation and Networks Executive Agency (INEA) in January 2020. Pangeanic is working alongside a number of partners including the National French Center for Scientific Research, LSP Tilde, language resource center ELRA, and the University of Malta.
Using AI processing of Named Entity Recognition (NER), the tool identifies personal details in line with the EU’s General Data Protection Regulation (GDPR). Data such as names, credit card numbers, dates, and professions are anonymized. Entering the English sentence “Rosalind Franklin was born on 25 July 1920,” for instance, will return “******* ******** was born on ** **** ****.”
The tool is available in all 24 official EU languages and focuses on the legal and medical domains. The beta version will be released in June 2021, while the final toolkit will be available later in the year for several use cases.
The tool will be downloadable with a fully deployable “docker” and an open-source license. (A docker is a product that wraps around software, ensuring data security and enabling connection with other software without the need for custom APIs.) Once released, users will be able to incorporate the tool into their own processes by building on existing code.
Dual Needs: Transparency and Compliance
The MAPA project arose as a means to address the dilemma: How can public administrators share data across public bodies and borders within the EU while protecting EU citizens’ data?
Manuel Herranz, CEO of Pangeanic, told Slator, “EU administrators suffer from a double mandate. They must be seen to offer transparency in the way data is shared across the EU while also complying with GDPR.”
A tool such as MAPA, which reliably removes personal details in all EU languages, will pave the way for EU administrations to benefit from big data by, for example, sharing large datasets for machine learning purposes. The first major use case, according to Herranz, will be European Complaints Watch, which will be provided with a locally-run data anonymization service per EU country.
Languages Learn From Other Languages
While the pre-beta version was developed within a year of project launch, the MAPA initiative has not been without its challenges. Covid-19 disrupted a plan to focus equally on legal and medical texts. “EU health authorities are already stressed so we’ve ended up focusing more on the legal domain,” Herranz said.
On the plus side, the AI-based tool revealed an intriguing ability: languages can learn from other languages. Herranz explained, “That’s the beauty of neural networks. We found that by mixing everything together in one large multilingual stew, the tool could recognize entities in languages for which it had not been trained.”
The finding gave the MAPA team an advantage. Low resource languages could be trained to a reasonable level of accuracy using general multilingual data, then topped up with targeted data to enhance quality. “Maltese already ran very well when we had no Maltese network; and then by adding Maltese data, we were able to really fine-tune the results,” Herranz added.
So far, reaction to the pre-beta release has been positive. Herranz said, “We’ve received very good comments saying it’s working very well in Latvian, Spanish, and French.”
However, the project is still a work in progress. The MAPA team is shooting for an accuracy target of above 95%; and, while some languages are performing at 98%, others are sitting at around the 89% mark. According to Herranz, “Results are pretty promising in most languages but there is still more work to do.”