Top streaming platform Netflix is now publicizing its research on Speech and Music Activity Detection — with a particular focus on “its use in localization and dubbing.”
Speech and Music Activity Detection (SMAD) refers to the process, and technology, that allows users to separately track the amount of speech and music in each frame in an audio file.
The practical uses of SMAD include preparation for some of the most common localization tasks in the media and entertainment space.
Classifying and segmenting long-form audio for large datasets can be useful for translation and dub-script generation; dialogue analysis and processing may be a prerequisite for spoken-language identification and speech transcription.
Music information retrieval, while seemingly slightly further afield, can also apply to song lyric transcription, as musical passages with lyrics are often translated and included in subtitles (as well as in closed captions).
Helpful though SMAD may be for localization, Iroro Orife, Chih-Wei Wu, and Yun-Ning (Amy) Hung — the authors of a November 13, 2023 Netflix Technology Blog post — pointed out that labeling speech and music activity at the audio-frame level, and at scale, is expensive and labor-intensive. Moreover, copyright limitations often prevent audio content from being shared publicly.
Several datasets are available publicly, but they are less than ideal, according to the paper Netflix’s researchers published in EURASIP Journal on Audio, Speech, and Music Processing in September 2022.
Two large datasets contain labels that “can be used only for either speech or music detection, but not both,” the authors wrote, which is an issue for series and films, in which music and dialogue often coincide. Similarly, several other datasets deal only with short segments and can classify audio segments as speech, music, or noise — again, no overlap.
So Netflix decided to create its own large-scale dataset instead — using the company’s own extensive catalog of TV series and films.
Why Is This Dataset Different from All Other Datasets?
“We show how leveraging a large-scale dataset with noisy labels can improve SMAD results. The presented TV Speech and Music (TVSM) dataset is derived from around 1600 h of professionally recorded and produced audio for TV shows,” the authors wrote. “The noisy labels are derived from different sources such as subtitles, scripted musical cue sheets, or pre-trained model’s predictions.”
That is correct: Researchers turned to subtitles (in part) to inform their speech labels. Subtitle timestamps are considered a reliable source of the approximate start- and end-times of speech utterances, and typically include lyrics from singing voices. (Closed captions, on the other hand, were not used because they contain all audio information, such as background noise, and not just speech.)
The 1,608 hours of professionally recorded and produced audio came from “large and diverse” content across genres, all published between 2016 and 2019.
Sixty percent of the content originated in the United States. The dataset also included three different languages: English accounted for 77%; Spanish, 20%; and Japanese, the remaining 3%.
The team trained its 832,000 parameter-strong model on subsets of 20-second segments obtained from audio files at random. Two training subsets had noisy labels (automated from the subtitles) and a third subset had clean, manually created annotations.
They then tested the model’s performance on four open datasets containing audio data from television programs, YouTube clips, and miscellaneous content, including concerts, radio broadcasts, and “low-fidelity folk music.”
“Compared to two third-party methods trained with synthetic and small-scale data, our proposed benchmark methods were able to generalize better and outperform state-of-the-art results on several existing datasets, in spite of training on noisy labels,” the authors concluded, adding the caveat that “the quality of the labels is still crucial for further improvements.”
The blog post acknowledges that Netflix’s interest in this area of research is the potentially substantial productivity returns for teams across the world, working in dozens of different languages.
Netflix has made its audio features and labels available on Zenodo, and linked to a GitHub repository with multiple audio tools, such as Python code and pre-trained models.
Interestingly, however, Netflix has kept with the status quo in the entertainment industry by not open-sourcing the TVSM dataset itself — perhaps due to the same copyright issues that sparked its creation in the first place.