Audio-text pre-training (ATP) has witnessed remarkable strides across a variety of downstream tasks. Yet, most existing pretrained audio models only specialize in either discriminative tasks or generative tasks. In this study, we develop SLIT, a novel ATP framework which transfers flexibly to both audio-text understanding and generation tasks, bootstrapping audio-text pre-training from frozen pretrained audio encoders and large language models. To bridge the modality gap during pre-training, we leverage Q-Former, which undergoes a multi-stage pre-training process. The first stage enhances audio-text representation learning from a frozen audio encoder, while the second stage boosts audio-to-text generative learning with a frozen language model. Furthermore, we introduce an ATP instruction tuning strategy, which enables flexible and informative feature extraction tailered to the given instructions for different tasks. Experiments show that SLIT achieves superior performances on a variety of audio-text understanding and generation tasks, and even demonstrates strong generalization capabilities when directly applied to zero-shot scenarios.
翻译:音频-文本预训练(ATP)在各类下游任务中取得了显著进展。然而,现有大多数预训练音频模型仅专精于判别式任务或生成式任务之一。本研究开发了SLIT,一种新型ATP框架,可灵活迁移至音频-文本理解与生成两类任务,通过从冻结的预训练音频编码器和大语言模型中引导音频-文本预训练。为弥合预训练过程中的模态差距,我们采用Q-Former,并对其执行多阶段预训练流程:第一阶段从冻结音频编码器增强音频-文本表示学习,第二阶段利用冻结语言模型提升音频到文本的生成学习。此外,我们提出了一种ATP指令微调策略,能够根据给定指令为不同任务提取灵活且富有信息的特征。实验表明,SLIT在多种音频-文本理解与生成任务上均取得优越性能,甚至在零样本场景中直接应用时也展现出强大的泛化能力。