Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters
翻译:诸如CLIP等视觉语言模型已通过提示学习适应于部分标注的多标签识别任务,其中为每个类别学习正负提示,以在共享的视觉-文本特征空间中将其嵌入与类别的存在或缺失相关联。尽管该方法通过依赖视觉语言模型的先验知识提升了多标签识别性能,我们假设学习负向提示可能并非最优策略,因为训练视觉语言模型所用的数据集缺乏明确聚焦于类别缺失的图像-标注对。为分析正负提示学习对多标签识别的影响,我们提出PositiveCoOp与NegativeCoOp方法:在视觉语言模型引导下仅学习单一提示,而另一提示则替换为直接在共享特征空间中学习、不依赖文本编码器的嵌入向量。通过实证分析,我们发现负向提示会降低多标签识别性能,且仅学习正向提示结合习得的负向嵌入(PositiveCoOp)的表现优于双提示学习方法。此外,我们量化了提示学习相较于简单纯视觉特征基线的性能优势:当缺失标签比例较低时,基线方法展现出与双提示学习方法(DualCoOp)相当的性能表现,同时仅需一半的训练计算量和1/16的参数量。