Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. Furthermore, we propose caption generation under the guidance of AudioSet tags, leading to more accurate captions. With the above two improvements, we curate high-quality, large-scale parallel audio-text data, based on which we perform audio-text pre-training. We comprehensively demonstrate the performance of the pre-trained model on a series of downstream audio-related tasks, including single-modality tasks like audio classification and tagging, as well as cross-modal tasks consisting of audio-text retrieval and audio-based text generation. Experimental results indicate that our approach achieves state-of-the-art zero-shot classification performance on most datasets, suggesting the effectiveness of our synthetic data. The audio encoder also serves as an efficient pattern recognition model by fine-tuning it on audio-related tasks. Synthetic data and pre-trained models are available online.
翻译:与丰富的视觉-文本预训练研究相比,音频-文本预训练的探索相对较少,主要原因是缺乏充足的平行音频-文本数据。现有方法大多引入视觉模态作为音频-文本预训练的中间桥梁,但这一过程不可避免地引入数据噪声。本文提出利用音频描述生成技术直接从音频生成文本,无需借助视觉模态,从而消除模态不匹配带来的潜在噪声。此外,我们提出在AudioSet标签指导下生成描述,以获得更准确的描述文本。基于上述两项改进,我们构建了高质量、大规模平行音频-文本数据,并在此基础上进行音频-文本预训练。我们在系列下游音频相关任务上全面验证了预训练模型的性能,包括音频分类与标注等单模态任务,以及音频-文本检索和基于音频的文本生成等跨模态任务。实验结果表明,我们的方法在大多数数据集上实现了最先进的零样本分类性能,验证了合成数据的有效性。通过在下游音频任务上进行微调,该音频编码器还可作为高效的模式识别模型。合成数据与预训练模型已在线公开。