ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

from arxiv, Accepted at ECCV 2024. Contact: zhuwilliam[at]google[dot]com. GitHub: https://github.com/google-research/google-research/tree/master/attribute_with_prefixlm

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).

翻译：从物体中识别并解耦视觉属性是许多计算机视觉应用的基础。尽管像CLIP这样的大规模视觉语言表征已基本解决了零样本物体识别任务，但零样本视觉属性识别仍面临挑战，因为CLIP通过对比学习获得的视觉语言表征无法有效捕捉物体与属性间的依赖关系。本文针对这一缺陷，提出了一种基于句子生成的检索框架用于属性识别，其创新性在于：1）将待度量和检索的物体-属性关系显式建模为条件概率图，从而将识别问题转化为依赖关系敏感的语言建模问题；2）将大规模预训练视觉语言模型应用于该重构框架，自然蒸馏其关于图像-物体-属性关系的知识以用于属性识别。具体而言，对于图像中待识别的每个属性，我们通过测量生成简短句子的视觉条件概率来量化该属性与图像中物体的关联关系。与通过全局对齐句子元素与图像的对比检索不同，生成式检索对句子中物体与属性的顺序和依赖关系具有敏感性。我们在两个视觉推理数据集——野外视觉属性数据集和新提出的视觉基因组属性排序数据集上的实验表明，生成式检索在属性识别任务上始终优于对比检索。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日