Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference.
翻译:提示微调在视觉-语言模型(VLMs)中取得了显著进展,并最近被应用于音频-语言模型(ALMs)。然而,其在ALMs中的泛化能力尚缺乏深入探索。我们观察到,ALMs的传统提示微调同样面临基类-新类权衡问题,并发现该问题源于嵌入空间中语义结构被破坏。为解决这一问题,我们提出了语义扩展提示微调(SEPT)——一种即插即用的框架,通过引入大语言模型生成的语义近邻来显式规整提示嵌入空间。SEPT引入了一种新颖的带边际约束的语义扩展损失,该损失能促进类内紧凑性和类间可分离性,从而增强提示嵌入空间的语义结构。为进行全面评估,我们建立了ALMs提示泛化的首个基准测试框架,涵盖基类到新类的泛化能力与跨数据集迁移性。大量实验表明,SEPT能在保持推理阶段计算成本不变的前提下,持续提升多种提示微调基线方法的泛化性能。