Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at https://github.com/hyebin-c/aspl.
翻译:音频语言模型(ALMs)通过将音频波形与文本对齐,在零样本音频分类中取得了显著成功。近期提升下游性能的努力聚焦于学习最优文本提示。然而,先前的方法仅关注文本编码器,忽略了音频编码器中可学习提示的潜力。本文提出了一种新颖框架,将可训练提示引入音频编码器以捕获任务特定的声学特征。我们证明,将音频侧提示学习与现有文本侧方法相结合能增强少样本适应能力。在11个数据集上的广泛实验表明,将本方法作为即插即用模块与现有文本提示调优相结合,通常能带来性能提升。这些发现表明,显式调制音频表示空间可有效补充纯文本提示方法。代码已开源至https://github.com/hyebin-c/aspl。