Language Technology and Human Translation firm Pangeanic announced today that it crossed the 10 billion aligned data segments mark in 84 languages, propelling the company forward in its mission to build and train new machine learning technologies.
The company reached a new milestone last week when it confirmed it had successfully clocked the 10,200,054th segment, boosting its research and development capabilities for machine translation and Natural Language Processing (NLP) technologies.
Manuel Herranz, Pangeanic CEO, stated: “In an increasingly data-centric society, the value of companies is often derived by the quality of the data they manage, structure and produce. In order to be cutting-edge in machine translation, and in many other NLP disciplines, the value of human-approved data is essential. The best algorithm is worthless if it does not have millions of segments to learn from. Our automated data acquisition pipelines make our repositories a goldmine for data scientists.”
Pangeanic has carved a name for itself in the language technology space by developing cutting-edge algorithms, infrastructures and toolkits as well as leading data-focused European projects, most recently spearheading its 2020 European-wide anonymization project, built with state-of-the-art NLP tools.
Pangeanic and its sister division PangeaMT, have gathered and trained a diversified pool of data from different sources; including open source data, human-produced data, anonymizing data from public sources, crawling from websites, and even creating near-human, highly scalable in-domain synthetic data.
Pangeanic’s Chief Research Scientist Mercedes Garcia said: “Having reached this milestone is a great step forward for us because it means that we can automatically obtain high-quality translations in many languages and domains.”
“Machine learning is an area of AI where data is the basic ingredient. Without data you can’t generate or build an automatic model or system. This is really the value of the company, having access to all this data.”
Pangeanic’s tech team uses this rich bank of data to train AI algorithms that partners, companies and institutions can benefit from. NTEU, the company’s recent European Commission-funded project, sees Pangeanic implementing Automatic Translation across Member States’ Public Administrations.
NTEU along with other Pangeanic projects are based on neural machine translation engines that require volumes of quality data the company farms daily to create a proprietary data repository.
Ms Garcia said: “Neural networks imitate the behavior of a brain. Therefore, large amounts of data along with examples of sentences or segments are needed when training a neural model.”
“Models based on machine learning learn by examples fed to them through data collected in datasets. Good results rely on high quality data, and domain specific data for particular applications.”
She explained Pangeanic’s data achieves high quality after it is rigorously cleaned as selected by the team, and edited by expert in-house translators who maintain, improve and grow the quality of the data to obtain “really near human results… sometimes scaringly human-like!”.
The company also boasts of a huge archive of in-domain data, specialised data for defined areas such as finance, banking, robotics, dialogs, social media and entertainment, medical and legal fields.
Ms Garcia said: “Acquiring in-domain data is extremely important in order to produce quality translations in specific areas. For example, having quality medical data is crucial for us to develop automatic systems for that specific field.”
“This is part of our competitive advantage, we are specialized in adapting systems to specific areas.”
Pangeanic’s Programmer and Data Analyst Alex Kohan agreed, noting Pangeanic’s capability to adapt language to different fields could also be said for adapting language styles and variants.
He said: “If we would like Portuguese to sound more Brazilian for example, then we can build processes to adapt the data by including Brazilian-specific data.”
Mr Kohan outlined Pangeanic’s trained segments also consist of under-resourced languages the company’s team built in-house through automatic data gathering processes.
He said: “You need voluminous amounts of data samples to obtain quality machine translation, because when you clean data you may lose several thousands of segments. Although a percentage of some stock data may come from open repositories, it is usually not trustable because of the noise it may contain. Having the assurance and confidence that a dataset is X% reliable adds to better processes”.
“We build synthetic data where there is less source data available. This occurs when working with under-resourced languages such as Maltese or Irish Gaelic, as there is less original data available on the internet.”
Aside from collecting language data, Mr Kohan said Pangeanic will focus on expanding its data gathering efforts in 2020 by widening the remit of data it collects to train AI-based systems.
He said: “We have some exciting projects coming up, we’re looking at collecting different types of data, like voice for speech language translation and pictures and videos to automatically categorize them on a large-scale.”