On October 14, 2023, researchers at Microsoft Cloud and AI, Microsoft Research Asia, and Shanghai Jiao Tong University published updated results for the capabilities of ComSL (Composite Speech-Language Model), a speech-language model originally introduced in a paper in May 2023.
According to the researchers, the ComSL model is based on public pretrained speech-only (audio data) and language-only (text data) models and has been optimized for spoken language tasks by integrating both modalities into its training.
The main differentiator of the ComSL model, explained the researchers, is that it outperforms the results achieved through “end-to-end modeling,” the most widely used training methodology thus far. End-to-end modeling uses audio and text data separately even if, the researchers say, they “may not be optimal for each other.”
In the composite model, the researchers obtained a simpler cross-modality learning that uses speech-text mapping/matching. The training allows the model to perform better and does not require any force-aligned speech and text.
For their methodology, the researchers applied machine translation (MT) and automated speech recognition (ASR) as what they call “auxiliary tasks” in a multi-task learning mode during the optimization of the end-to-end speech translation (ST) model.
Multi-task learning (MTL) mode implies “sharing common knowledge among different tasks” so that the MT task can guide the ST task. However, the researchers stated that, because of the mismatch between speech and text modalities, the guidance was not as effective.
The ComSL model was trained with existing, fine-tuned models, including speech-only input and text-only input, as well as with ST, ASR, and MT as tasks and a “cross-modality learning (CML)” approach based on paired speech-text input instead of forced-alignment.
The training steps consisted of fine-tuning the language model (with all the paired text data), multi-task learning (the tasks were ST, MT, ASR, and CML), regularization on the MT output (fine-tuning with MT tasks), and freezing speech encoder (retaining speech representations at the start of fine-tuning).
400 hours of English
The experiments in this study involved the CoVoST 2 dataset, which comprises translations from 21 languages into English and from English into 15 languages, and approximately 400 hours of English recordings and 900 hours of recordings from 21 additional languages.
The researchers focused mainly on the non-English language into English speech translation, measuring performance with BLEU scores and the CoVoST 2 testing set. The models utilized as the baseline were Whisper and mBART-50, themselves fine-tuned with CoVoST 2.
The composite model was found to outperform the base speech model (Whisper) and the combination of speech and language models (Whisper+mBART). The incorporation of ST data contributed to a high score on the CoVoST2 testing set, and the composite model was also evaluated on speech-to-text translation tasks with better results than those known for the end-to-end modeling that includes the same tasks of ST, ASR, and MT.