logo image
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • Design Thinking – February 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs
MENU
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • Design Thinking – February 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs

Advertise on Slator! Download the 2021 Online Media Kit Now

  • Slator Market Intelligence
  • Slator Advertising Services
  • Slator Advisory
  • Login
Search
Generic filters
Exact matches only
Advertisement
Pangeanic: Over 10Bn Alignments for Machine Learning in 84 Languages

7 months ago

June 19, 2020

Pangeanic: Over 10Bn Alignments for Machine Learning in 84 Languages

Press Releases ·

by Pangeanic

On June 19, 2020

7 months ago
Press Releases ·

by Pangeanic

On June 19, 2020

Pangeanic: Over 10Bn Alignments for Machine Learning in 84 Languages

Language Technology and Human Translation firm Pangeanic announced today that it crossed the 10 billion aligned data segments mark in 84 languages, propelling the company forward in its mission to build and train new machine learning technologies.

The company reached a new milestone last week when it confirmed it had successfully clocked the 10,200,054th segment, boosting its research and development capabilities for machine translation and Natural Language Processing (NLP) technologies.

Manuel Herranz, Pangeanic CEO, stated: “In an increasingly data-centric society, the value of companies is often derived by the quality of the data they manage, structure and produce. In order to be cutting-edge in machine translation, and in many other NLP disciplines, the value of human-approved data is essential. The best algorithm is worthless if it does not have millions of segments to learn from. Our automated data acquisition pipelines make our repositories a goldmine for data scientists.”

Advertisement

Pangeanic has carved a name for itself in the language technology space by developing cutting-edge algorithms, infrastructures and toolkits as well as leading data-focused European projects, most recently spearheading its 2020 European-wide anonymization project, built with state-of-the-art NLP tools.

Pangeanic and its sister division PangeaMT, have gathered and trained a diversified pool of data from different sources; including open source data, human-produced data, anonymizing data from public sources, crawling from websites, and even creating near-human, highly scalable in-domain synthetic data.

Pangeanic’s Chief Research Scientist Mercedes Garcia said: “Having reached this milestone is a great step forward for us because it means that we can automatically obtain  high-quality translations in many languages and domains.”

“Machine learning is an area of AI where data is the basic ingredient. Without data you can’t generate or build an automatic model or system. This is really the value of the company, having access to all this data.”

Pangeanic’s tech team uses this rich bank of data to train AI algorithms that partners, companies and institutions can benefit from. NTEU, the company’s recent European Commission-funded project, sees Pangeanic implementing Automatic Translation across Member States’ Public Administrations.

NTEU along with other Pangeanic projects are based on neural machine translation engines that require volumes of quality data the company farms daily to create a proprietary data repository. 

Ms. Mercedes Garcia, Chief Scientist

Ms Garcia said: “Neural networks imitate the behavior of a brain. Therefore, large amounts of data along with examples of sentences or segments are needed when training a neural model.” 

“Models based on machine learning learn by examples fed to them through data collected in datasets. Good results rely on high quality data, and domain specific data for particular applications.”

She explained Pangeanic’s data achieves high quality after it is rigorously cleaned as selected by the team, and edited by expert in-house translators who maintain, improve and grow the quality of the data to obtain “really near human results… sometimes scaringly human-like!”.

The company also boasts of a huge archive of in-domain data, specialised data for defined areas such as finance, banking, robotics, dialogs, social media and entertainment, medical and legal fields.

Ms Garcia said: “Acquiring in-domain data is extremely important in order to produce quality translations in specific areas. For example, having quality medical data is crucial for us to develop automatic systems for that specific field.”

“This is part of our competitive advantage, we are specialized in adapting systems to specific areas.” 

Pangeanic's Alex Kohan
Alex Kohan, Data Analyst

Pangeanic’s Programmer and Data Analyst Alex Kohan agreed, noting Pangeanic’s capability to adapt language to different fields could also be said for adapting language styles and variants. 

He said: “If we would like Portuguese to sound more Brazilian for example, then we can build processes to adapt the data by including Brazilian-specific data.”

Mr Kohan outlined Pangeanic’s trained segments also consist of under-resourced languages the company’s team built in-house through automatic data gathering processes.

He said: “You need voluminous amounts of data samples to obtain quality machine translation, because when you clean data you may lose several thousands of segments. Although a percentage of some stock data may come from open repositories, it is usually not trustable because of the noise it may contain. Having the assurance and confidence that a dataset is X% reliable adds to better processes”.

“We build synthetic data where there is less source data available. This occurs when working with under-resourced languages such as Maltese or Irish Gaelic, as there is less original data available on the internet.”

Aside from collecting language data, Mr Kohan said Pangeanic will focus on expanding its data gathering efforts in 2020 by widening the remit of data it collects to train AI-based systems.

He said: “We have some exciting projects coming up, we’re looking at collecting different types of data, like voice for speech language translation and pictures and videos to automatically categorize them on a large-scale.”

TAGS

language datamachine learningPangeanic
SHARE
Pangeanic

By Pangeanic

As a technology and translation company, Pangeanic specialises in the automation of as many language processes as possible, serving cross-national institutions, multinationals and government agencies all over the world. With nearly 2 decades of experience in supplying translation services in over 100 languages, we know what works for multilingual publication.

Advertisement

SUBSCRIBE TO THE SLATOR WEEKLY

Language Industry Intelligence
In Your Inbox. Every Friday

SUBSCRIBE

SlatorSweepSlatorPro
ResearchRFP CENTER

PUBLISH

PRESS RELEASEDIRECTORY LISTING
JOB ADEVENT LISTING

Bespoke advisory including speaking, briefings and M&A

SLATOR ADVISORY
Advertisement

Featured Reports

See all
Slator 2020 Language Industry M&A and Funding Report

Slator 2020 Language Industry M&A and Funding Report

by Slator

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

by Slator

Slator 2020 Medtech Translation and Localization Report

Slator 2020 Medtech Translation and Localization Report

by Slator

Pro Guide: Sales and Marketing for Language Service Providers

Pro Guide: Sales and Marketing for Language Service Providers

by Slator

Press Releases

See all
XTRF Launches a Bi-Monthly Free Networking Event for Localization Professionals

XTRF Launches a Bi-Monthly Free Networking Event for Localization Professionals

by XTRF

150 Million Words Translated: the German EU Council Presidency Translator Sets New Records

150 Million Words Translated: the German EU Council Presidency Translator Sets New Records

by Tilde

BeLazy Announces Full Automation for Plunet

BeLazy Announces Full Automation for Plunet

by BeLazy

Upcoming Events

See All
  1. Memsource MT Post-Editing Pricing Models Webinar

    Pricing Models for MT Post-Editing Workshop

    by Memsource

    · February 3

    Hear a panel of innovative localization professionals share different approaches for MT post-editing pricing.

    More info FREE

Featured Companies

See all
Text United

Text United

Memsource

Memsource

Wordbank

Wordbank

Protranslating

Protranslating

Seprotec

Seprotec

Versacom

Versacom

SDL

SDL

Smartling

Smartling

Lingotek

Lingotek

XTM International

XTM International

Smartcat

Smartcat

Translators without Borders

Translators without Borders

STAR Group

STAR Group

memoQ Translation Technologies

memoQ Translation Technologies

Advertisement

Popular articles

Why Netflix Shut Down Its Translation Portal Hermes

Why Netflix Shut Down Its Translation Portal Hermes

by Esther Bond

The Slator 2020 Language Service Provider Index

The Slator 2020 Language Service Provider Index

by Slator

Top Language Industry Quotes of 2020

Top Language Industry Quotes of 2020

by Monica Jamieson

SlatorPod: The Weekly Language Industry Podcast

connect with us

footer logo

Slator makes business sense of the language services and technology market.

Our Company

  • Support
  • About us
  • Terms & Conditions
  • Privacy Policy

Subscribe to the Slator Weekly

Language Industry Intelligence
In Your Inbox. Every Friday

© 2021 Slator. All rights reserved.

Sign up to the Slator Weekly

Join over 13,000 subscribers and get the latest language industry intelligence every Friday

Your information will never be shared with third parties. No Spam.