Chinese technology company Alibaba has released a large-scale audio-language model, Qwen-Audio, that handles more than 30 distinct audio tasks — including multilingual automatic speech recognition (ASR) and translation.
According to a November 2023 paper by Alibaba researchers Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou, predecessors support “a limited range of interaction capabilities,” but directly co-training models on all tasks and datasets can cause interference issues.
Qwen-Audio’s multitask training framework, by contrast, uses a set of hierarchical tags to encourage knowledge-sharing while avoiding interference. “Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts,“ the authors concluded.
AssemblyAI Developer Educator Marco Ramponi explained in a December 7, 2023 blog post that, compared to text, audio data is more difficult for large language models (LLMs) to process because it tends to be more information-dense, conveying information such as speaker emotion through tone, pace, emphasis, and loudness.
In Ramponi’s view, these challenges make Qwen-Audio’s progress toward so-called “universal audio understanding” — that is, an AI system that can interpret and “make sense” of audio input for downstream tasks such as speech translation and speech editing — all the more impressive.
Alibaba exposed Qwen-Audio to different linguistic styles and acoustic qualities, with a wide spectrum of audio including human speech, natural sounds, instrumental music, and songs with lyrics. Qwen-Audio builds on the open-source Qwen-7B language model, which includes a 32-layer Transformer decoder with 7.7bn parameters.
While Qwen-Audio reportedly scales up to eight different languages and speech-to-text translation was evaluated on the CoVoST2 dataset, outperforming baseline models “across all seven translation directions,” the only language pair the paper mentions by name is Mandarin-English.
Regardless, the team is optimistic about Qwen-Audio’s capabilities, having already used Qwen-Audio as a basis for “Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.”
Alibaba has plenty of company in the speech translation sphere, where competitors Google and OpenAI are already vying for dominance via Gemini and Whisper, respectively. Amazon and Meta have been similarly active throughout 2023.