Vision-language models (VLMs) can learn high-quality representations from a large-scale training dataset of image-text pairs. Prompt learning is a popular approach to fine-tuning VLM to adapt them to downstream tasks. Despite the satisfying performance, a major limitation of prompt learning is the demand for labelled data. In real-world scenarios, we may only obtain candidate labels (where the true label is included) instead of the true labels due to data privacy or sensitivity issues. In this paper, we provide the first study on prompt learning with candidate labels for VLMs. We empirically demonstrate that prompt learning is more advantageous than other fine-tuning methods, for handling candidate labels. Nonetheless, its performance drops when the label ambiguity increases. In order to improve its robustness, we propose a simple yet effective framework that better leverages the prior knowledge of VLMs to guide the learning process with candidate labels. Specifically, our framework disambiguates candidate labels by aligning the model output with the mixed class posterior jointly predicted by both the learnable and the handcrafted prompt. Besides, our framework can be equipped with various off-the-shelf training objectives for learning with candidate labels to further improve their performance. Extensive experiments demonstrate the effectiveness of our proposed framework.
翻译:视觉语言模型(VLMs)能够从大规模图像-文本对训练数据集中学习高质量的表示。提示学习是微调VLM以适应下游任务的常用方法。尽管性能令人满意,但提示学习的主要局限在于对标注数据的需求。在实际场景中,由于数据隐私或敏感性问题,我们可能仅能获得候选标签(包含真实标签)而非真实标签。本文首次针对视觉语言模型开展基于候选标签的提示学习研究。我们通过实验证明,在处理候选标签时,提示学习相比其他微调方法更具优势。然而,当标签歧义性增加时,其性能会出现下降。为提升其鲁棒性,我们提出了一种简单而有效的框架,该框架能更好地利用视觉语言模型的先验知识来指导候选标签的学习过程。具体而言,我们的框架通过将模型输出与可学习提示及人工设计提示共同预测的混合类别后验进行对齐,从而消除候选标签的歧义。此外,该框架可结合多种现成的候选标签训练目标,以进一步提升性能。大量实验验证了所提框架的有效性。