Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.
翻译:对比式语言-音频预训练(CLAP)在学习语义丰富的音频表征方面取得了显著成功,并被广泛应用于各类音频相关任务。然而,现有CLAP模型存在若干关键局限:首先,它们通常在规模相对较小的数据集上进行训练,通常仅包含数百万音频样本;其次,现有CLAP模型受限于短时固定时长音频,这制约了其在真实场景中处理可变时长音频的能力;第三,标准对比训练目标仅作用于全局表征,可能阻碍对密集细粒度音频特征的学习。为应对这些挑战,我们提出可扩展语言-音频预训练方法(SLAP),将语言-音频预训练扩展到包含1.09亿个音频-文本对且支持可变音频时长的数据集,并融合了多训练目标。SLAP在单阶段训练中统一了对比损失与额外的自监督及描述生成损失,从而促进学习更丰富的密集音频表征。所提出的SLAP模型在音频-文本检索和零样本音频分类任务中实现了新的最优性能,在多样化基准测试中验证了其有效性。