Meta Releases Large Dataset for Multilingual Speech-to-Speech Translation

Meta Speech to Speech Translation

Meta released, in early November 2022, SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations. The goal, according to Meta, is to make the development of speech-to-speech translation (S2ST) systems easier.

SpeechMatrix was mined from real speech; that is, European Parliament recordings. It contains speech alignments in 136 languages, at an average of 1,537 hours of source speech in each direction, making a total of more than 418,000 hours of speech.

“To the best of our knowledge, SpeechMatrix is by far the largest freely available speech-to-speech translation corpus,” wrote the Meta researchers in their paper.

Data Scarcity

As mentioned in the Slator Interpreting Services and Technology Report, big tech companies and academia are driving rapid advancements in the area of speech-to-speech translation.

Speech-to-speech translation models can be indirect — via text and machine translation — or. direct, building machine learning models based on audio recordings of speech in source and target languages.

Direct models are attracting more research interest and have many advantages. For instance, they apply to the translation of languages without a well-defined writing script as direct models do not rely on any intermediate text. However, model training is faced with the major issue of data scarcity.

As the researchers explained, “Human-labeled speech data is expensive to create, there are very few data resources providing parallel speech, and the data amount is quite limited.”

Mined Data Quality and Multilingual S2ST

To evaluate the quality of the mined data, the Meta researchers trained bilingual speech-to-speech translation models on SpeechMatrix data and reported on translation performance.

Enabled by the multilinguality of SpeechMatrix, they also explored multilingual speech-to-speech translation.

According to the same paper, “There are very few studies of multilingual speech-to-speech translation, partially due to the lack of multilingual speech-to-speech resources. With the massively multilingual data we have mined, we are able to explore multilingual S2ST training.”

The researchers discovered that strong S2ST models can be trained with mined data and validated the good quality of speech alignments across languages.

In addition, they demonstrated that model pre-training, sparse scaling using Mixture-of-Experts — an ensemble machine-learning technique where the number of parameters of the model increases in magnitude without sacrificing computation efficiency — and multilinguality can “bring large gains to translation performance.”

The researchers hope that this work can help others develop textless, speech-to-speech translation systems for other written and unwritten languages.

Everything related to SpeechMatrix is open source and accessible for download via the GitHub repository.