Common though it may be, trawling the web for linguistic data has its drawbacks, from copyright issues to offensive speech (and even machine-translated, not labeled as such).
A group of researchers decided to sidestep these challenges and opted instead for Wikimedia Commons, whose publicly available collection of audio and transcription files reduces the chance of fallout.
The result is Speech Wikimedia, a multilingual dataset of 1,780 hours of licensed speech (195 GB) that includes 77 different languages and a range of scenarios and speakers. Each audio file has at least one transcript in a different language, making the dataset “suitable” for training speech recognition, speech translation, and machine translation (MT) models. (The authors also found a collection of audio files without transcriptions, but have yet to explore it.)
Of course, the work behind the August 2023 paper introducing the dataset was not motivated by goodwill alone.
Collaborating authors hailed from a variety of institutions with a vested interest in furthering, and eventually commercializing, this research (much like social media and AI juggernaut Meta). Authors represented Factored.ai (Rafael Mosquera Gómez, Julian Eusse, Juan Ciro), NVIDIA (Daniel Galvez), Talon Voice (Ryan Hilleman), Long Now Foundation (Kurt Bollacker), and MLCommons (David Kanter).
Despite recent advancements in speech research, the paper stated that multilingual datasets are few and far between.
MT dataset OpenSubtitles offers 1,782 language pairings from movie subtitles in 62 languages. The usefulness of such a large-scale dataset is offset by a lack of licensure for commercial usage, based on the source of the subtitles.
Speech Wikimedia’s long list of languages gives it an edge over Multilingual Librispeech and VoxPopuli, which contain data for just eight and 23 languages, respectively.
Offering almost 10 times the amount of audio, Mozilla Common Voice’s 17,690 hours of volunteers reading in 108 languages — but Speech Wikimedia covers much more diverse scenarios. The most common audio file content included current events, history, and “general non-fiction references.”
To create the dataset, researchers downloaded raw video and audio from Wikimedia Commons, an attractive source in the sense that its data is licensed or within the public domain, making it usable for (future) commercial research.
For ASR, the team found audio and transcripts with a common language — 69% of the dataset in all. Researchers could identify transcript languages based on filenames, but they used Whisper’s language detection pipeline to identify unknown audio files.
The vast majority of audio files contained English (1,488 hours), with Dutch and German following after a sharp drop-off, at 22 and 12 hours of audio apiece. Interestingly, some languages with far fewer speakers, such as Welsh and Basque, had more audio than typically high-resource languages, namely Arabic and Korean.
Moving on to speech translation, 31% of the dataset (or 628 hours) consisted of audio files paired with transcriptions in a language other than the original. Dominating the matches was English audio with Spanish transcription (67 hours), followed by English audio and transcripts in Arabic, French, Portuguese, Dutch, German, Italian, and Russian.
Latin and Welsh audio, with their corresponding English transcripts, made appearances in the top 20 language combinations, with 11 and eight hours each. The only non-English language pair to break through the top 20 was Dutch-Russian, with just five hours of audio.
Unlike ASR and speech translation, MT focuses exclusively on text translation. Researchers found multiple transcriptions associated with a single audio file — almost 11% of the audio files were accompanied by transcriptions in at least three different languages.
Except for Arabic, language pairings for MT consisted solely of European languages. English-Spanish topped the list (text for 135 hours of audio), followed by English-French (85 hours) and English-Portuguese (57 hours). The highest-volume non-English pairs were Spanish-Portuguese (54 hours) and Spanish-French (51 hours).
The dataset was designed for future use in training models, but it is not quite ready — the raw data is publicly available on HuggingFace but still needs to be processed. Looking ahead, researchers noted that video data, removed for the purposes of this dataset, could be helpful for future multimodal task research (think Meta’s SeamlessM4T).