Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associate audio features with human language, making it a natural zero-shot classifier to recognize unseen sound categories. To adapt CLAP to downstream tasks, prior works inevitably require labeled domain audios, which limits their scalability under data scarcity and deprives them of the capability to detect novel classes as the original CLAP. In this work, by leveraging the modality alignment in CLAP, we propose an efficient audio-free prompt tuning scheme aimed at optimizing a few prompt tokens from texts instead of audios, which regularizes the model space to avoid overfitting the seen classes as well. Based on this, a multi-grained prompt design is further explored to fuse global and local information. Experiments on several tasks demonstrate that our approach can boost the CLAP and outperform other training methods on model performance and training efficiency. While conducting zero-shot inference on unseen categories, it still shows better transferability than the vanilla CLAP. Moreover, our method is flexible enough even if only knowing the downstream class names. The code will be released soon.
翻译:对比语言-音频预训练(CLAP)通过预训练将音频特征与人类语言关联,使其成为可识别未见声音类别的天然零样本分类器。为将CLAP适配至下游任务,以往研究不可避免地需要标注领域音频,这在数据稀缺场景下限制了其可扩展性,并使其丧失原始CLAP检测新类别的能力。本文利用CLAP中的模态对齐,提出一种高效的无音频提示调优方案,旨在通过文本而非音频优化少量提示标记,从而规范化模型空间以避免对已见类别的过拟合。在此基础上,进一步探索多粒度提示设计以融合全局与局部信息。多项任务实验表明,我们的方法能增强CLAP性能,并在模型表现与训练效率上优于其他训练方法。在对未见类别的零样本推理中,它仍展现出优于原始CLAP的迁移能力。此外,即使仅知晓下游类别名称,我们的方法仍具备充分灵活性。代码即将发布。