Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/
翻译:音频语言模型(ALM)受视觉语言模型(VLM)进展的启发,通过将音频波形特征与特定类别的文本提示特征进行匹配,近期在零样本音频识别任务中取得了显著成功。鉴于零样本性能对手工设计文本提示的选择较为敏感,目前已针对VLM开发了多种提示学习技术。本文探讨了这些方法在ALM中的有效性,并提出了一种新颖的方法——音频语言模型提示学习(PALM),该方法通过优化文本编码器分支的特征空间实现性能提升。与现有在输入空间操作的方法不同,我们的方法具有更高的训练效率。我们在涵盖多种语音处理任务的11个音频识别数据集上验证了该方法的有效性,并在少样本学习设置下与三种基线方法进行了对比。实验表明,我们的方法在计算需求更低的同时,性能与其他方法相当或更优。代码发布于https://asif-hanif.github.io/palm/