Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
翻译:摘要:基于对比学习的预训练音频-语言模型(如CLAP)在片段级理解方面表现出色,但在帧级任务上存在困难。现有扩展方法未能充分利用现实世界音频-文本数据中不同粒度的特性,其中海量片段级文本描述与有限的帧级标注并存。本文提出了细粒度语言-音频预训练(FineLAP),这是一种新颖的训练范式,通过异构数据同时提升CLAP中的片段级和帧级对齐能力。FineLAP引入了一种基于聚类的采样策略的双流sigmoid损失函数,以联合学习来自片段级和帧级监督的信息。为了同时捕获全局语义和局部细节,FineLAP在自监督编码器之上采用了解耦的音频投影器。针对时间标注数据稀缺的问题,我们构建了FineLAP-100k,这是一个通过可扩展的策展流程生成的大规模合成声音事件检测(SED)数据集。大量实验表明,FineLAP在多种音频理解任务(包括检索、分类、声音事件检测和文本到音频定位)中达到了最先进的性能。消融研究进一步表明,粗粒度和细粒度对齐具有相互促进作用,为构建更优的音频-语言模型(ALMs)提供了洞见。