Pangeanic Wins Contract to Lead European-wide Anonymization Project

Pangeanic Logo

INEA awarded Pangeanic’s consortium almost €1M last January to develop a multilingual anonymization toolkit based on the AI processing of Named Entity Recognition in the fields of health, life science, and justice. The final anonymization toolkit will be available as a downloadable, fully deployable docker with an open-source license. 

There are several Public Administrations involved in the consortium and more use cases in EU Member States are envisaged as the need for anonymization among public sector organizations grows. European Public Administrations across the world have a mandate for transparency and open data that, however, clashes with the need to keep personal data safe and not share it to third parties.

The MAPA Project (Multilingual Anonymization toolkit for Public Administrations) will make use of state-of-the-art Natural Language Processing tools to develop the open source toolkit with a focus on the medical and legal domains, deploying it at several Public Administrations in Europe.

“MAPA will provide a high degree of data de-identification so organizations can share or release data, while protecting privacy. Implementation cases will focus on de-identifying, obfuscating or pseudo-anonymizing personally identifiable information. It will also be language independent, so no matter what language an organization deals with or the names mentioned therein, the solution will erase personal data. GDPR has changed the way data is transmitted, and there is an increased interest in protecting the individual’s privacy.” – Manuel Herranz, CEO

Some of Pangeanic’s development team at PangeaMT, Innsomnia Accelerator,  Valencia.

The toolkit developed by the MAPA partners (Pangeanic, Tilde, the National French Center for Scientific Research (LIMSI at CNRS), the language resource center ELRA, the University of Malta, R&D Center Vicomtech, and Spanish Language Plan Government Office SEDIA via the Barcelona Supercomputing Center) will address all official EU languages. 

Why Anonymize Data?

GDPR obliges organizations to protect citizens’ data so it is not released to 3rd parties (see this video on Pangeanic’s anonymization technologies). The MAPA data anonymization toolkit will provide the means to share language data while protecting personal or sensitive data. Being able to release large amounts of anonymized data can help the community to have more training data for machine learning, for example. It can help companies with centers across the world move data safely across jurisdictions. 

On a more practical level, justice departments, health authorities, healthcare companies will be able to provide access to data and manage a de-identification strategy. Custom deployment cases will prove the adaptability and customization capabilities of the solution. Most importantly, MAPA will satisfy GDPR requirements at scale. Although no software can guarantee 100% accuracy in anonymization, just as perfect machine translation does not exist (yet), it will make document sharing much easier.

Technical Approach to Anonymization

At its core, the MAPA anonymisation toolkit will use Named-Entity Recognition and Classification (NERC) techniques using both Deep Learning techniques and neural networks. The challenge of working with under-resourced languages such as Latvian, Lithuanian, Estonian, Slovenian or Croatian will be tackled by a multilingual NERC approach, to also benefit ultra-under-resourced languages such as Maltese and Irish.

In addition, thanks to the transfer learning capabilities shown by new types of Deep-Learning models, new systems can be trained using relatively small datasets of manually labelled data. The knowledge acquired for a given domain or language can be transferred and re-used cross-language or cross-domain. MAPA will be trained to detect named entities that involve sensitive information.

MAPA will be feature-rich and the NERC approach will be complemented with other configurable mechanisms such as pattern detection based on regular expressions (passport or ID numbers, telephone numbers, street addresses, blood groups, age, sex, marital status, email addresses, bank accounts, etc.)

User-definable dictionaries for particular applications will also cater for specific usages of entity names known in advance.

Use cases

MAPA includes several specific deployments/use cases for public institutions at several EU countries: one for the health domain and one for the legal domain. Both domains were selected given their strong anonymization requirements prior to any publication and sharing of the data. In each deployment case, the system will be tailored to the specific needs of the relevant institution.

MAPA is funded by the Connecting Europe Facility (CEF) program, under grant No A2019/1927065, and will run from January 2020 until December 2021.