logo image
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • Design Thinking – February 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs
MENU
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • Design Thinking – February 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs

Advertise on Slator! Download the 2021 Online Media Kit Now

  • Slator Market Intelligence
  • Slator Advertising Services
  • Slator Advisory
  • Login
Search
Generic filters
Exact matches only
Advertisement
PangeaMT Reaches 1.2 Billion Bilingual Aligned Sentences as Training Data

2 years ago

November 27, 2018

PangeaMT Reaches 1.2 Billion Bilingual Aligned Sentences as Training Data

Press Releases ·

by Pangeanic

On November 27, 2018

2 years ago
Press Releases ·

by Pangeanic

On November 27, 2018

PangeaMT Reaches 1.2 Billion Bilingual Aligned Sentences as Training Data

Data has become a key input for driving growth, enabling businesses to differentiate themselves and maintain a competitive edge, it is now at the forefront of most businesses’ Research and Development. Given the growing importance of data, in 2016 Pangeanic created the Corpora Task-Force to mine language resources and to create the “Pangea Corpus”, a multilingual, multiple-aligned language repository.

Last month the Pangea Corpus reached the 1.200.000.000 mark for its translation alignments.

We ask Amando Estela, head of the Corpora Task Force, about their work so far.

Advertisement

What exactly is the Corpora Task-Force?
At Pangeanic we specialize in Artificial Intelligence applied to language processing, mostly using Neural Networks, but also for monolingual technologies. Nowadays, deep learning is considered a ripe technology and while the processes involved have been well known at a theoretical level for some time, its biggest downfall tends to be the lack of availability of enough clean and reliable data to be able to train neural network engines. The Task Force was created with one simple goal: to acquire as much language resources as possible, either monolingual or multilingual.

What is the importance for Pangeanic to have a large language corpus?
Clean data is the “raw material” that the AI system consumes during its training. As a standard rule, the higher quality and the larger amount of data used during the training, the better the output of the engine.

We never have enough data because new uses for AI appear every day, be that new translation engines for under-resourced languages, specialized (in-domain) translation engines for a particular field with specific terminology, or monolingual AIs with many usages like summarization, style-correction or information extraction- to name a few.

How large is the Corpus so far?
Pangeanic has reached the impressive mark of 1.200.000.000 translation alignments and the automated system we’ve set in place is currently acquiring some 3M new alignments per day.

For every language pair we acquire, at least 20M alignments have no specific domain. We also acquire resources for main language modes (vernacular, assertive, formal,…) and variations (dialects).

How has Pangeanic reached this milestone?
The main bulk of the corpus is acquired by crawling or mining open source repositories. We’ve set up a farm of servers (up to 50) to crawl the repositories and to try to establish the alignments. Because we want clean data, every possible alignment is submitted to an NLP service in order to check the quality of its alignment, to be normalized and eventually anonymized. The NLP service basically works as a filter letting only 25% of the alignments into the corpus.

More specific alignments have been acquired from internal Pangeanic resources (client donations, data organizations and public repositories from the EU, the UN and UN agencies, European Central Bank, national institutes of statistics ).

How has Pangeanic compiled this data in relation to recent GDPR legislation?
All data is aggressively anonymized unless the repository storing it marks the data as fully open and reusable. Aggressive anonymization means that acquired data is stripped off all tokens (names, entities, dates, numbers, addresses, …) which may hint at the original source or convey personal or private data. Also, context is lost as we work at segment level. A final measure is to completely reject any segment with less than 100% certainty on anonymity.

Why would you say that language data is important in this day and age?
Data, in general, is a valuable asset. Increasingly, companies are evaluated by the data that they own. We are expecting to see efforts in data acquisition not only by companies, but also by officials or open source organizations at national or multinational levels.

What plans does Pangeanic have for its compilation of language data?
This is going to be a continuous effort. Once we reach the 20-30M alignments for the 50 most widely-spoken human languages, starting with English, we begin acquiring data in specific domains (legal, dialogues for films, life-sciences, energy, engineering…). We also have a plan to generate synthetic alignments triangulating the corpus to generate 10-100 times indirect data.

All that data… it seems like a huge amount. How do you organize and manage it?
We use ActivaTM, which works as a large-scale memory translation database capable of storing data in its monolingual or multilingual format. That database has been selected as the database of choice for EU Member States’ national language repositories coming from public translation contracts in a recent EU contract.

What is in store for the future for language data and AI?
I believe that we are at the dawn of a new era with AI, you only have to look at recent advances in language processing and the importance of some language-based companies in Asia and the US. Every day, new specialised hardware appears which makes data gathering, selection by domain and auto-categorization, training and cleaning faster and more easily available. Language is an essential communication tool used by all people. AI and neural systems help not only to bridge language gaps, they also help to extract knowledge from millions of inputs, find trends, preferences and soon, use those millions of aligned sentences to predict.

TAGS

Amando EstelaPangea CorpusPangeanic
SHARE
Pangeanic

By Pangeanic

As a technology and translation company, Pangeanic specialises in the automation of as many language processes as possible, serving cross-national institutions, multinationals and government agencies all over the world. With nearly 2 decades of experience in supplying translation services in over 100 languages, we know what works for multilingual publication.

Advertisement

SUBSCRIBE TO THE SLATOR WEEKLY

Language Industry Intelligence
In Your Inbox. Every Friday

SUBSCRIBE

SlatorSweepSlatorPro
ResearchRFP CENTER

PUBLISH

PRESS RELEASEDIRECTORY LISTING
JOB ADEVENT LISTING

Bespoke advisory including speaking, briefings and M&A

SLATOR ADVISORY
Advertisement

Featured Reports

See all
Slator 2020 Language Industry M&A and Funding Report

Slator 2020 Language Industry M&A and Funding Report

by Slator

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

by Slator

Slator 2020 Medtech Translation and Localization Report

Slator 2020 Medtech Translation and Localization Report

by Slator

Pro Guide: Sales and Marketing for Language Service Providers

Pro Guide: Sales and Marketing for Language Service Providers

by Slator

Press Releases

See all
iDISC Awarded ISO 27001 Information Security Management Certification

iDISC Awarded ISO 27001 Information Security Management Certification

by iDISC

XTRF Launches a Bi-Monthly Free Networking Event for Localization Professionals

XTRF Launches a Bi-Monthly Free Networking Event for Localization Professionals

by XTRF

150 Million Words Translated: the German EU Council Presidency Translator Sets New Records

150 Million Words Translated: the German EU Council Presidency Translator Sets New Records

by Tilde

Upcoming Events

See All
  1. Memsource MT Post-Editing Pricing Models Webinar

    Pricing Models for MT Post-Editing Workshop

    by Memsource

    · February 3

    Hear a panel of innovative localization professionals share different approaches for MT post-editing pricing.

    More info FREE

Featured Companies

See all
Text United

Text United

Memsource

Memsource

Wordbank

Wordbank

Protranslating

Protranslating

Seprotec

Seprotec

Versacom

Versacom

SDL

SDL

Smartling

Smartling

Lingotek

Lingotek

XTM International

XTM International

Smartcat

Smartcat

Translators without Borders

Translators without Borders

STAR Group

STAR Group

memoQ Translation Technologies

memoQ Translation Technologies

Advertisement

Popular articles

Why Netflix Shut Down Its Translation Portal Hermes

Why Netflix Shut Down Its Translation Portal Hermes

by Esther Bond

The Slator 2020 Language Service Provider Index

The Slator 2020 Language Service Provider Index

by Slator

Top Language Industry Quotes of 2020

Top Language Industry Quotes of 2020

by Monica Jamieson

SlatorPod: The Weekly Language Industry Podcast

connect with us

footer logo

Slator makes business sense of the language services and technology market.

Our Company

  • Support
  • About us
  • Terms & Conditions
  • Privacy Policy

Subscribe to the Slator Weekly

Language Industry Intelligence
In Your Inbox. Every Friday

© 2021 Slator. All rights reserved.

Sign up to the Slator Weekly

Join over 13,000 subscribers and get the latest language industry intelligence every Friday

Your information will never be shared with third parties. No Spam.