UC Berkeley Open Sources Nearly 500k English and Japanese Clips to Train AI Dubbing

AI Dubbing Japanese Data

In a paper published on January 10, 2024, UC Berkeley researchers Kevin Cai, Chonghua Liu, and David M. Chan introduced the Anim-400K dataset, a comprehensive resource for research and development in automated dubbing.

Anim-400K is a large-scale dataset containing aligned audio-video clips in English and Japanese. It comprises over 425,000 aligned clips, totaling 763 hours, sourced from more than 190 properties across various themes and genres. 

Anim-400K is also enriched with metadata, offering information at different levels, such as property, episode, and clip. This metadata includes details like genres, themes, show ratings, character profiles, animation styles, episode synopses, ratings, and subtitles.

Due to its rich metadata, Anim-400K can not only support automated dubbing but also facilitate additional tasks, including video summarization, character identification and description, genre/theme/style classification, video quality analysis, and simultaneous translation.

The Need for a New Dataset

Despite progress in automated subtitling through advancements in automatic speech recognition (ASR) and machine translation (MT), dubbing translation remains at a relatively early stage on the automation curve.

Current automated dubbing systems that rely on complex pipelines — combining ASR, MT, and text-to-speech (TTS) systems — lack the nuanced intricacies required for effective dubbing, such as precise timing, facial movements synchronization, and prosody matching.

While end-to-end dubbing is a potential solution, a lack of data hinders its development, limiting the quality of end-to-end dubbing models. The authors noted that the Heroes corpus is the primary source for training/testing data, but its size (7K samples) proves inadequate for training deep neural networks. 

Instead, researchers often turn to privately collected datasets or opt for simultaneous translation (ST) datasets like MuST-C and MuST-Cinema. However, these ST datasets, while rich in source audio, lack elements for evaluating crucial qualities like prosody, lip-matching, timing, and spoken translation.

“It is clear that a new large-scale dataset is required to fill the training gap between ST datasets and high-quality manually aligned datasets such as the Heroes and IWSLT corpuses,” said the authors.

A Strong Complement to Latin-Based Datasets

To fill this gap, they introduced Anim-400K, “a large-scale fully aligned dataset of audio segments containing true dubbed audio distributions.”

“Anim-400K is a relatively large dataset on a non-latin based language, making it a strong complement to any latin-based dataset,” said the authors.

Sourced by scraping publicly available dubbed anime videos from popular anime-watching websites, Anim-400K includes raw episodes with both Japanese and English audio tracks, complemented by English subtitles for the Japanese track. Unlike previous approaches relying on a bottom-up approach, a top-down approach was used for extracting aligned segments, ensuring better alignment, and capturing unique performance content.

As the authors explained, a bottom-up involves analyzing individual words and segments using resources like movie scripts and subtitles. Although this method aligns segments based on available textual information, it may not ensure complete alignment, and segments may match well with the audio but might not be fully synchronized.

In contrast, a top-down approach starts from a higher level and ensures that all segments are always aligned, even in the presence of noise (ASR noise, speaker noise). This approach has the added benefit (or drawback) of enabling the model to capture unique performance content not found in transcripts, including non-speech utterances.

The authors concluded that Anim-400K “holds great promise for improving accessibility and engagement.” The dataset is publicly available for research purposes on GitHub.