Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
翻译:从物体中识别并解耦视觉属性是许多计算机视觉应用的基础。尽管像CLIP这样的大规模视觉语言表征已基本解决了零样本物体识别任务,但零样本视觉属性识别仍面临挑战,因为CLIP通过对比学习获得的视觉语言表征无法有效捕捉物体与属性间的依赖关系。本文针对这一缺陷,提出了一种基于句子生成的检索框架用于属性识别,其创新性在于:1)将待度量和检索的物体-属性关系显式建模为条件概率图,从而将识别问题转化为依赖关系敏感的语言建模问题;2)将大规模预训练视觉语言模型应用于该重构框架,自然蒸馏其关于图像-物体-属性关系的知识以用于属性识别。具体而言,对于图像中待识别的每个属性,我们通过测量生成简短句子的视觉条件概率来量化该属性与图像中物体的关联关系。与通过全局对齐句子元素与图像的对比检索不同,生成式检索对句子中物体与属性的顺序和依赖关系具有敏感性。我们在两个视觉推理数据集——野外视觉属性数据集和新提出的视觉基因组属性排序数据集上的实验表明,生成式检索在属性识别任务上始终优于对比检索。