Large pre-trained Vision-Language Models (VLMs), like CLIP, exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios. Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics. This limitation hinders the model's ability to perceive fine-grained visual details and restricts its generalization ability to a broader range of unseen classes. To address this issue, we propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment. The proposed MAP enjoys several merits. First, we introduce learnable visual attribute prompts enhanced by textual attribute semantics to adaptively capture visual attributes for images from unknown categories, boosting fine-grained visual perception capabilities for CLIP. Second, the proposed attribute-level alignment complements the global alignment to enhance the robustness of cross-modal alignment for open-vocabulary objects. To our knowledge, this is the first work to establish cross-modal attribute-level alignment for CLIP-based few-shot adaptation. Extensive experimental results on 11 datasets demonstrate that our method performs favorably against state-of-the-art approaches.
翻译:大型预训练视觉语言模型(如CLIP)在下游任务中展现出强大的泛化能力,但在少样本场景下表现欠佳。现有的提示技术主要关注全局文本和图像表示,却忽视了多模态属性特征。这一局限阻碍了模型感知细粒度视觉细节的能力,并限制了其向更广泛未见类别的泛化能力。为解决此问题,我们提出一种多模态属性提示方法(MAP),通过联合探索文本属性提示、视觉属性提示和属性级对齐。所提出的MAP具有多重优势:首先,我们引入可学习的视觉属性提示,通过文本属性语义增强,自适应地捕捉未知类别图像的视觉属性,从而提升CLIP的细粒度视觉感知能力;其次,提出的属性级对齐与全局对齐形成互补,增强了开放词汇对象的跨模态对齐鲁棒性。据我们所知,这是首个为基于CLIP的少样本适应建立跨模态属性级对齐的工作。在11个数据集上的大量实验结果表明,我们的方法相较于现有先进方法具有显著优势。