Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
翻译:视觉与文本软提示微调能有效提升视觉-语言模型在下游任务中的适应性。然而,针对视频任务进行微调会损害模型对未见类别的泛化能力。现有方法试图通过约束手工提示与软提示之间的差距来缓解这种遗忘效应,但这同时削弱了软提示的学习能力。针对这一挑战,我们提出一种即插即用的耦合提示学习框架,旨在优化视觉-语言模型在视频任务中的泛化性能,其核心动机是通过引入外部监督提示来缓解微调过程中的语义空间收缩。具体而言,针对文本提示,我们引入其他数据集的预训练提示作为硬提示令牌,将其与软提示令牌拼接,并通过可学习映射层进行耦合。这种竞争式提示方法可防止语义空间过度拟合监督类别。此外,我们引入精心设计的无关视频集与负提示作为通用属性锚点,以维持预训练语义空间中属性的通用关联性,从而保留泛化能力。视频任务实验表明,我们的方法在泛化基准测试中显著优于最优的提示微调方法,尤其在基类到新类预测任务上表现突出。