Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using taskspecific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first improves and then worsens during training;(ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problems. In this study, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages, respectively, leading to the non-overfitting and overfitting phenomena. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process and successfully eliminate the overfitting problem. In addition, we equip CoOp with a Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art CoCoOp approach. Experiments on more challenging vision downstream tasks, including open-vocabulary object detection and zero-shot semantic segmentation, also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.
翻译:预训练的视觉-语言模型(如CLIP)在搭配恰当文本提示时,已展现出对下游视觉任务的强大泛化能力。为替代人工设计提示,近期提出的上下文优化方法(Context Optimization, CoOp)通过任务特定训练数据学习连续提示。尽管该方法提升了下游任务性能,但多项研究指出CoOp存在两方面过拟合问题:(i)基类测试准确率在训练初期提升后持续下降;(ii)新类测试准确率持续恶化。然而现有研究均未能理解并缓解此类过拟合问题。本研究首先通过分析梯度流探索过拟合成因。对比实验表明,CoOp在训练早期与后期分别倾向利用可泛化特征与虚假特征,从而引发非过拟合与过拟合现象。基于此观察,我们提出子空间提示调优(Subspace Prompt Tuning, SubPT),在完整训练过程中将反向传播梯度投影至由早期梯度流特征向量张成的低秩子空间,成功消除过拟合问题。此外,我们为CoOp配备新型特征学习器(Novel Feature Learner, NFL),在不依赖图像训练数据的情况下,增强习得提示对训练集外新类别的泛化能力。通过在11个分类数据集上的广泛实验证明,SubPT+NFL可稳定提升CoOp性能,并超越当前最优方法CoCoOp。在更具挑战性的视觉下游任务实验(包括开放词汇目标检测与零样本语义分割)中,同样验证了所提方法的有效性。代码详见 https://tinyurl.com/mpe64f89。