Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
翻译:时序特征建模在音频波形的表示学习中具有重要作用。本文提出对比性长文本语言-音频预训练(\textbf{CoLLAP}),将输入音频(最长5分钟)与语言描述(超过250词)的感知窗口显著扩展,同时实现跨模态与时序动态的对比学习。通过利用最新音乐大语言模型为完整歌曲生成具有音乐时序结构增强的长文本音乐描述,我们从大规模AudioSet训练数据集中收集了51.3K个音频-文本对,其平均音频长度达到288秒。我们提出一种新颖的对比学习架构,通过将每首歌曲分割为片段并提取其嵌入表示,将语言表征与结构化音频表征相融合。借助注意力机制,我们捕获多模态时序相关性,使模型能够自动加权并增强最终融合分数,从而提升对比对齐效果。最后,我们开发了两种具有不同骨干语言模型的CoLLAP变体。通过在多个长文本音乐-文本检索数据集上的综合实验,我们证明了相较于基线方法,检索准确率获得持续提升。我们还展示了预训练的CoLLAP模型能够迁移至具有异构长文本多模态语境的多种音乐信息检索任务中。