Unpacking Multilingual Data Management and Anonymization

Multilingual Data Anonymous

During a SlatorCon Remote September 2022 panel discussion, Pangeanic CEO, Manuel Herranz, and CTO, Amando Estela discussed the company’s projects around data anonymization, pseudonymization, and data masking, as well as its evolution.

Pangeanic started as a Japanese company’s European branch, which was acquired by Herranz in the early 2000s. Initially, the company operated as a language services provider (LSP), focusing its technology R&D efforts on statistical machine translation. Machine translation (MT) is still one of the company’s strongest offerings today.

Herranz encouraged attendees to listen to SlatorPod #43 to learn more about the origins of Pangeanic. From its early days, the company has since developed several tools that enhance its existing MT product, becoming a highly-specialized natural language processing (NLP) company focused on data processing.

Complying With GDPR Via Data Anonymization

Herranz and Estela discussed the versatility of proprietary NLP solutions, such as MT and automatic data classification and de-identification, which allow companies to leverage data while remaining compliant with regulations, including the General Data Protection Regulation (GDPR).

Pangeanic is a partner in the European Union’s MAPA project (Multilingual Anonymization for Public Administrations). MAPA encompasses all official EU languages, and Pangeanic has contributed software tools that directly remove identificatory properties of data, including personal information like people’s names.

Slator Commercial Director, Andrew Smart, asked the Pangeanic experts about some of the primary industries that would take advantage of these tools. Herranz said that, although the legal and financial sectors remain key users, the largest clients for data anonymization are utility companies and business-to-consumer (B2C) firms.

The Pangeanic CEO explained that B2C companies “hold a lot of personal traceable data that can be reutilized or monetized,” including travel or transportation preferences and eating habits. Once anonymized, the data becomes GDPR-compliant and can be used in myriad ways.

CTO Estela added, “Anonymization is some kind of translation, because you are translating from English into GDPR-compliant English.”

Violating GDPR Can Result in Fines

Herranz noted how different users have different preferences and needs when it comes to data anonymization techniques, such as data masking and data swapping. One example of data masking, he said, is when the average age (i.e., age range) is used instead of a specific number. As for replacing words (e.g., data swapping, pseudonymization), one could replace the city or a person’s name.

He also explained how some companies can be in violation of GDPR by confusing anonymization with data hashing; that is, the process of mapping data of an arbitrary size to data of a fixed size by using a hash function or by creating levels of access. Neither method provides GDPR compliance because identificatory data would still be visible. Those companies could face fines in the millions of euros, according to Herranz.

SlatorCon London 2024 | £ 980

SlatorCon London 2024 | £ 980

A rich 1-day conference which brings together 140+ industry leaders views and thriving language technologies.

Buy Tickets

Register Now

Multilingual Data Management 

Anonymization extends to translation databases, which could be legally sold as compliant translation memories. On the question of anonymization for low-resource languages, CTO Estela pointed out how “translation requires something like 30 million examples in order to train an engine. [But for low-resource languages,] we need something like 300,000.” Sizing the data is, thus, key to approaching the anonymization process.

The Pangeanic CTO went on to say that engines are not trained to anonymize; rather they are designed to detect that something is a name and something else is an address.

The panel discussion closed with the topic of other types of data anonymization and ways to monetize it, such as conversations. Herranz said, “All those conversations have value […] and have to be anonymized as a legal requirement.”

“There isn’t a compliance team anywhere in the world that hasn’t thought about anonymization or some kind of data masking of personal identifiable data. There are ways in which data can be shared safely […] it is a gold mine,” the Pangeanic CEO concluded.