Here’s a ‘Brand-New’ Massive Multilingual Dataset for Machine Translation

Brand-New Massive Multilingual Dataset for Machine Translation

In a March 20, 2024 paper, a team of researchers from the University of Helsinki, the University of Edinburgh, the University of Oslo, the University of Turku, and Prompsit introduced a new massive multilingual dataset for language modeling and machine translation (MT) training.

This dataset, known as the HPLT (High Performance Language Technologies) language resources, comprises both monolingual and bilingual corpora.

The researchers highlighted that this dataset is unique because it is “brand-new”, sourced from web crawls provided by the Internet Archive — a non-profit digital library of millions of free books, movies, software, music, websites, and more — and CommonCrawl — a free, open repository of web crawl data — marking the first time these resources have been utilized on such a large scale to create multilingual text corpora.

“The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training,” they said.

The researchers detailed the methods used for acquiring, managing, and processing large corpora, which rely on open-source software tools and high-performance computing. 

They have also made the corpora, software, and tools publicly available on GitHub, with the aim of setting “an example for others inside and outside the research community.”

The monolingual collection covers 75 languages — both high-resourced and low-resourced ones, but with a particular emphasis on low- to medium-resourced languages, totaling 5.25 billion documents. 

The parallel corpus is English-centric and includes 18 language pairs with over 96 million aligned sentence pairs. The researchers highlighted that the dataset emphasizes low-resource languages, aiming at enhancing the availability of parallel data for MT development. 

Furthermore, the researchers generated a synthetic dataset by pivoting their existing parallel datasets through English. This synthetic dataset encompasses 171 language pairs and comprises 157 million sentence pairs.

The datasets also contain metadata, “which can be employed by end users to conduct their own filtering,” according to the researchers.

First Datasets, Then Models

The researchers have also released some initial MT models and large language models (LLMs), along with the training pipelines used to create them. “First dataset, then models!,” they tweeted on March 1, 2024.

To date, they have trained MT models for 16 language pairs, but they aim to build translation models with all language pairs included in the first release of the HPLT parallel data. The first LLMs were focused on Finnish and Norwegian languages, they are currently in the process of training a multilingual Nordic model and they have started the process of training a family of massively multilingual European models with a dataset covering all official EU languages.

All models are openly published on Hugging Face and the HPLT project website, while the training code is available on the HPLT GitHub repository. “The idea is that a third party should be able to use this repository, together with our tool chain, to completely reproduce our model building,” they explained.

Environmental Impact

The researchers also underscored the significant expenses and environmental impact associated with creating large datasets for language modeling. By publicly releasing these datasets on open-source platforms, they aim to mitigate this impact by promoting reuse instead of starting from scratch.

Additionally, all models were trained on the LUMI supercomputer in Finland, currently the fastest in Europe and fifth globally, as well as the seventh greenest supercomputer in the world, powered entirely by renewable carbon-neutral energy. 

Looking ahead, they plan to expand language coverage, enhance datasets with more metadata, and improve tools for better corpus quality among others.

In closing, they also asked the community to join them in this effort by contributing both raw data sources and processed corpora. This collaborative effort will enrich the collection and benefit the research community as a whole.

Note: The High Performance Language Technologies project (HPLT) is a 3-year EU-funded project started in September 2022.

Authors: Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and Jörg Tiedemann