Audio-text pre-training (ATP) has witnessed remarkable strides across a variety of downstream tasks. Yet, most existing pretrained audio models only specialize in either discriminative tasks or generative tasks. In this study, we develop SLIT, a novel ATP framework which transfers flexibly to both audio-text understanding and generation tasks, bootstrapping audio-text pre-training from frozen pretrained audio encoders and large language models. To bridge the modality gap during pre-training, we leverage Q-Former, which undergoes a multi-stage pre-training process. The first stage enhances audio-text representation learning from a frozen audio encoder, while the second stage boosts audio-to-text generative learning with a frozen language model. Furthermore, we introduce an ATP instruction tuning strategy, which enables flexible and informative feature extraction tailered to the given instructions for different tasks. Experiments show that SLIT achieves superior performances on a variety of audio-text understanding and generation tasks, and even demonstrates strong generalization capabilities when directly applied to zero-shot scenarios.
翻译:摘要:音频-文本预训练(ATP)在各种下游任务中取得了显著进展。然而,现有的大多数预训练音频模型仅专精于判别式任务或生成式任务。在本研究中,我们开发了SLIT,一种新颖的ATP框架,它能够灵活地迁移至音频-文本理解与生成任务,并通过冻结的预训练音频编码器和大语言模型引导音频-文本预训练。为在预训练期间弥合模态差距,我们利用Q-Former,该模块经历了多阶段预训练过程。第一阶段从冻结的音频编码器增强音频-文本表示学习,而第二阶段通过冻结的语言模型提升音频到文本的生成式学习。此外,我们引入了ATP指令调优策略,该策略能够根据给定任务指令实现灵活且富有信息量的特征提取。实验表明,SLIT在各种音频-文本理解与生成任务上取得了优越性能,甚至在直接应用于零样本场景时也展现出强大的泛化能力。