CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.
翻译:CLIP作为一种基础视觉语言模型,因其能够理解各种视觉概念和自然语言描述,被广泛用于零样本图像分类。然而,如何充分利用CLIP前所未有的类人理解能力以实现更好的零样本分类仍是一个开放性问题。本文从人类视觉感知过程中汲取灵感:现代神经科学观点认为,在分类物体时,人类首先推断其类别无关属性(如背景和朝向),这有助于将前景物体与背景分离,随后基于这些信息做出决策。受此启发,我们观察到,为CLIP提供上下文属性可改善零样本分类并减轻对虚假特征的依赖。我们还观察到,CLIP自身能合理地从图像中推断出这些属性。基于这些发现,我们提出了一种无需训练的两步零样本分类方法,名为PerceptionCLIP。对于给定图像,该方法首先推断上下文属性(如背景),然后基于这些属性进行物体分类。实验表明,PerceptionCLIP实现了更好的泛化能力、群体鲁棒性和可解释性。例如,使用ViT-L/14的PerceptionCLIP在Waterbirds数据集上将最差组准确率提升了16.5%,在CelebA数据集上提升了3.5%。