Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
翻译:摘要:对比语言-音频预训练(CLAP)旨在对齐音频与语言表征,在检索和分类任务中取得了显著性能。然而,当前CLAP模型难以有效捕获音频与文本特征中的时间信息,这严重制约了其在音频检索与生成等任务中的应用潜力。为弥补这一缺陷,我们提出时间增强型CLAP模型T-CLAP。通过利用大语言模型(LLMs)与混合策略,我们从大规模音频-文本数据集中生成含时间对比关系的音频片段描述文本,进而设计了一种新型时间聚焦对比损失函数,利用合成数据对CLAP模型进行微调。我们在多项下游任务中开展全面实验与分析,结果表明T-CLAP显著提升了声音事件时间关系的捕获能力,并以较大优势超越了当前最优模型。