Soft prompt learning has recently emerged as one of the methods of choice for adapting V&L models to a downstream task using a few training examples. However, current methods significantly overfit the training data, suffering from large accuracy degradation when tested on unseen classes from the same domain. To this end, in this paper, we make the following 4 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we propose grouped LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) We identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) We show that LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets. Code will be made available at https://www.adrianbulat.com/lasp
翻译:软提示学习近来已成为利用少量训练样本将视觉与语言模型适配于下游任务的首选方法之一。然而,现有方法在训练数据上严重过拟合,当在相同领域的未见类上进行测试时,准确率大幅下降。为此,本文作出以下四项贡献:(1) 为缓解基类过拟合,我们提出了一种新颖的语言感知软提示学习方法,通过文本到文本的交叉熵损失,最大化所学提示相对预定义手工文本提示的正确分类概率。(2) 为提升提示的表征能力,我们提出了分组语言感知软提示,每组提示针对文本提示的独立子集进行优化。(3) 我们识别了由提示学习和语言感知软提示引入的视觉-语言错位,并进一步提出一种重校准机制以解决该问题。(4) 我们证明语言感知软提示在训练中天然适合于引入虚拟类(即无可视样本的类名),从而进一步增强所学提示的鲁棒性。通过在11个数据集上的评估,我们表明:(a) 我们的方法在软提示方面显著优于所有先前工作,(b) 首次在11个测试数据集的8个上匹配并超越了手工提示和CLIP在未见类上的准确率。代码将发布于 https://www.adrianbulat.com/lasp