PangeaMT Reaches 1.2 Billion Bilingual Aligned Sentences as Training Data

Pangeanic Logo

Data has become a key input for driving growth, enabling businesses to differentiate themselves and maintain a competitive edge, it is now at the forefront of most businesses’ Research and Development. Given the growing importance of data, in 2016 Pangeanic created the Corpora Task-Force to mine language resources and to create the “Pangea Corpus”, a multilingual, multiple-aligned language repository.

Last month the Pangea Corpus reached the mark for its translation alignments.

We ask Amando Estela, head of the Corpora Task Force, about their work so far.

What exactly is the Corpora Task-Force?
At Pangeanic we specialize in Artificial Intelligence applied to language processing, mostly using Neural Networks, but also for monolingual technologies. Nowadays, deep learning is considered a ripe technology and while the processes involved have been well known at a theoretical level for some time, its biggest downfall tends to be the lack of availability of enough clean and reliable data to be able to train neural network engines. The Task Force was created with one simple goal: to acquire as much language resources as possible, either monolingual or multilingual.

What is the importance for Pangeanic to have a large language corpus?
Clean data is the “raw material” that the AI system consumes during its training. As a standard rule, the higher quality and the larger amount of data used during the training, the better the output of the engine.

We never have enough data because new uses for AI appear every day, be that new translation engines for under-resourced languages, specialized (in-domain) translation engines for a particular field with specific terminology, or monolingual AIs with many usages like summarization, style-correction or information extraction- to name a few.

How large is the Corpus so far?
Pangeanic has reached the impressive mark of translation alignments and the automated system we’ve set in place is currently acquiring some 3M new alignments per day.

For every language pair we acquire, at least 20M alignments have no specific domain. We also acquire resources for main language modes (vernacular, assertive, formal,…) and variations (dialects).

How has Pangeanic reached this milestone?
The main bulk of the corpus is acquired by crawling or mining open source repositories. We’ve set up a farm of servers (up to 50) to crawl the repositories and to try to establish the alignments. Because we want clean data, every possible alignment is submitted to an NLP service in order to check the quality of its alignment, to be normalized and eventually anonymized. The NLP service basically works as a filter letting only 25% of the alignments into the corpus.

More specific alignments have been acquired from internal Pangeanic resources (client donations, data organizations and public repositories from the EU, the UN and UN agencies, European Central Bank, national institutes of statistics ).

How has Pangeanic compiled this data in relation to recent GDPR legislation?
All data is aggressively anonymized unless the repository storing it marks the data as fully open and reusable. Aggressive anonymization means that acquired data is stripped off all tokens (names, entities, dates, numbers, addresses, …) which may hint at the original source or convey personal or private data. Also, context is lost as we work at segment level. A final measure is to completely reject any segment with less than 100% certainty on anonymity.

Why would you say that language data is important in this day and age?
Data, in general, is a valuable asset. Increasingly, companies are evaluated by the data that they own. We are expecting to see efforts in data acquisition not only by companies, but also by officials or open source organizations at national or multinational levels.

What plans does Pangeanic have for its compilation of language data?
This is going to be a continuous effort. Once we reach the 20-30M alignments for the 50 most widely-spoken human languages, starting with English, we begin acquiring data in specific domains (legal, dialogues for films, life-sciences, energy, engineering…). We also have a plan to generate synthetic alignments triangulating the corpus to generate 10-100 times indirect data.

All that data… it seems like a huge amount. How do you organize and manage it?
We use ActivaTM, which works as a large-scale memory translation database capable of storing data in its monolingual or multilingual format. That database has been selected as the database of choice for EU Member States’ national language repositories coming from public translation contracts in a recent EU contract.

What is in store for the future for language data and AI?
I believe that we are at the dawn of a new era with AI, you only have to look at recent advances in language processing and the importance of some language-based companies in Asia and the US. Every day, new specialised hardware appears which makes data gathering, selection by domain and auto-categorization, training and cleaning faster and more easily available. Language is an essential communication tool used by all people. AI and neural systems help not only to bridge language gaps, they also help to extract knowledge from millions of inputs, find trends, preferences and soon, use those millions of aligned sentences to predict.