DesCo: Learning Object Recognition with Rich Language Descriptions

Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.

翻译：近期视觉-语言方法的发展引发了一场从语言监督中学习视觉识别模型的范式转变。这些方法将物体与语言查询（例如“一张猫的照片”）对齐，提升了模型识别新物体和适应新领域的能力。最近，若干研究尝试使用包含细粒度语义细节（如属性、形状、纹理和关系）的复杂语言表达来查询这些模型。然而，单纯将语言描述作为查询并不能确保模型准确理解。实际上，我们的实验表明，当前最先进的物体检测视觉-语言模型GLIP常常忽略语言描述中的上下文信息，而过度依赖仅通过物体名称进行检测。为应对这一挑战，我们提出了一种新的描述条件化（DesCo）范式，通过丰富语言描述学习物体识别模型，其包含两个主要创新：1）利用大语言模型作为常识知识引擎，基于物体名称和原始图像-文本描述生成物体的丰富语言描述；2）设计上下文敏感查询，提升模型解读描述中蕴含复杂细微差异的能力，并强制模型关注上下文而非仅关注物体名称。在两个新物体检测基准LVIS和OminiLabel上，采用零样本检测设置，我们的方法分别实现了34.8 APr minival（+9.1）和29.3 AP（+3.6），大幅超越了此前最先进的模型GLIP和FIBER。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日