Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
翻译:近期,遵循指令的音频-语言模型在人与音频的交互中获得了广泛关注。然而,由于缺乏能够处理多样化音频类型和任务的预训练音频模型,该领域的发展受到阻碍。因此,现有大多数工作仅能支持有限的交互能力。本文提出Qwen-Audio模型,通过扩展音频-语言预训练覆盖30余项任务及人声、自然声音、音乐、歌曲等多种音频类型,以促进通用音频理解能力。然而,直接联合训练所有任务和数据集会导致干扰问题——不同数据集关联的文本标签因任务侧重、语言、标注粒度和文本结构的差异而呈现显著变化。为克服一对多干扰,我们精心设计了一个多任务训练框架,通过向解码器输入层级化标签序列,利用共享标签与特定标签分别促进知识共享并避免干扰。值得关注的是,Qwen-Audio无需任何任务特定微调即可在各种基准任务上实现卓越性能,超越同类模型。基于Qwen-Audio的能力,我们进一步开发了Qwen-Audio-Chat,支持多种音频和文本输入,可实现多轮对话并适配多种音频中心场景。