In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we introduce MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLMs) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we propose Bridge-Net, a lightweight trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.
翻译:在音频-语言预训练(ALP)领域,实现跨模态对齐是一项重大挑战。此外,整合具有不同分布和任务差异的音频输入,给开发通用音频语言模型带来了挑战。在本研究中,我们提出了MINT,一种通过多目标预训练和指令微调提升音频语言模型的新型ALP框架。MINT利用冻结的预训练音频编码器和大型语言模型(LLM)的优势改进音频-语言预训练,使其能够有效地迁移至音频-文本理解与生成任务。为弥合模态鸿沟,我们提出Bridge-Net,一个轻量级可训练模块,可增强跨模态对齐以及模型遵循指令执行多种音频-文本任务的能力。Bridge-Net在MINT中起关键作用:首先通过多目标预训练方法增强音频-语言表示学习,随后通过整合冻结语言模型与指令微调,进一步促进音频到语言的生成式学习。这种整合使MINT能够以灵活高效的方式提取特征,并针对不同任务的指令进行特定适配。实验结果表明,MINT在各种音频-语言理解与生成任务中均取得优异性能,即使在零样本场景下也展现出强大的泛化能力。