Here’s a New Dataset for Emotion-Aware Speech Translation

Here’s a New Dataset for Emotion-Aware Speech Translation

In a May 21, 2024 paper researchers from the Technical University of Munich, Kyoto University, the AI software company SenseTime, and the National Institute of Informatics in Japan, introduced MELD-ST, a new dataset created to improve speech-to-text (S2TT) and speech-to-speech translation (S2ST) by integrating emotional context into the translation process.

The researchers emphasized that emotion plays a “crucial role” in human conversation and that accurately conveying emotions in translation is essential to preserving the intended intensity and sentiment. They provided the example of the phrase “Oh my God!” which can have different translations based on its emotional context. They highlighted that phrases expressing emotions like surprise, shock, or excitement need to be translated differently based on the emotion to make sense in another culture.

Previous studies have explored emotion-aware translation mainly in text-to-text translation (T2TT), with little focus on emotion in speech translation. With MELD-ST, the authors aim to fill this gap in the field of speech translation, where emotional nuances often go unaddressed.

MELD-ST is built on the existing MELD (Multimodal EmotionLines Dataset) dataset, which features emotionally rich dialogues, by adding corresponding speech data from the TV series “Friends.” It includes audio and subtitles in English-to-Japanese and English-to-German language pairs, each with 10,000 utterances annotated with emotion labels.

According to the researchers, the MELD-ST dataset differs from other datasets because: (1) it includes emotion labels for each utterance, making it valuable for experiments and analyses and (2) it features acted speech in an emotionally rich environment, making it suitable for initial studies on emotion-aware speech translation research.

To test MELD-ST, the researchers used the SEAMLESSM4T model for S2TT and S2ST experiments under different conditions: without fine-tuning, fine-tuning without emotion labels, and fine-tuning with emotion labels. They evaluated the performance using BLEURT scores for S2TT and ASR-BLEU for S2ST, along with other metrics like prosody, voice similarity, pauses, and speech rate.

They found that incorporating emotion labels can improve translation performance in some settings, particularly for S2TT tasks, where slight improvements were observed. “We can see that the quality of the translations generally improved after fine-tuning, and incorporating emotion labels led to slight enhancements,” they said.

Fine-Tuning with Emotion Labels Doesn’t Help

However, for S2ST tasks, fine-tuning with emotion labels did not significantly improve results. “We can see that fine-tuning the SEAMLESSM4T model improves the ASR-BLEU results. However, fine-tuning with emotion labels does not help,” they said.

The researchers acknowledged several limitations and noted that future research is needed to address these limitations and further develop emotion-aware speech translation systems.

For future work, they propose training multitask models that integrate speech emotion recognition with translation, using dialogue context to improve performance, and refining the dataset to include more natural speech settings.

The MELD-ST dataset is available on Hugging Face and is intended for research purposes only.

Authors: Sirou Chen, Sakiko Yahata, Shuichiro Shimizu, Zhengdong Yang, Yihang Li, Chenhui Chu, Sadao Kurohashi