Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.
翻译:视觉-语言模型如CLIP因能够理解各种视觉概念和自然语言描述而被广泛应用于零样本图像分类。然而,如何充分利用CLIP无与伦比的人类式理解能力以实现更优性能,仍是一个有待解决的问题。本文从人类视觉感知过程中汲取灵感:在分类物体时,人类首先推断上下文属性(如背景和方向),这些属性有助于将前景物体从背景中分离出来,随后基于这些信息对物体进行分类。受此启发,我们观察到,为CLIP提供上下文属性可改善零样本图像分类,并减少对伪特征的依赖。我们还发现,CLIP本身能够合理地从图像中推断这些属性。基于这些观察,我们提出了一种无需训练的两步式零样本分类方法PerceptionCLIP。给定一张图像,该方法首先推断上下文属性(如背景),然后基于这些属性进行物体分类。实验表明,PerceptionCLIP实现了更好的泛化性、群体鲁棒性和可解释性。例如,基于ViT-L/14的PerceptionCLIP在Waterbirds数据集中将最差群体准确率提升了16.5%,在CelebA数据集中提升了3.5%。