Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions that include specifications of fine-grained semantic details, such as attributes, shapes, textures, and relations. However, simply incorporating language descriptions as queries does not guarantee accurate interpretation by the models. In fact, our experiments show that GLIP, the state-of-the-art vision-language model for object detection, often disregards contextual information in the language descriptions and instead relies heavily on detecting objects solely by their names. To tackle the challenges, we propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions consisting of two major innovations: 1) we employ a large language model as a commonsense knowledge engine to generate rich language descriptions of objects based on object names and the raw image-text caption; 2) we design context-sensitive queries to improve the model's ability in deciphering intricate nuances embedded within descriptions and enforce the model to focus on context rather than object names alone. On two novel object detection benchmarks, LVIS and OminiLabel, under the zero-shot detection setting, our approach achieves 34.8 APr minival (+9.1) and 29.3 AP (+3.6), respectively, surpassing the prior state-of-the-art models, GLIP and FIBER, by a large margin.
翻译:近期视觉-语言方法的发展引发了一场从语言监督中学习视觉识别模型的范式转变。这些方法将物体与语言查询(例如“一张猫的照片”)对齐,提升了模型识别新物体和适应新领域的能力。最近,若干研究尝试使用包含细粒度语义细节(如属性、形状、纹理和关系)的复杂语言表达来查询这些模型。然而,单纯将语言描述作为查询并不能确保模型准确理解。实际上,我们的实验表明,当前最先进的物体检测视觉-语言模型GLIP常常忽略语言描述中的上下文信息,而过度依赖仅通过物体名称进行检测。为应对这一挑战,我们提出了一种新的描述条件化(DesCo)范式,通过丰富语言描述学习物体识别模型,其包含两个主要创新:1)利用大语言模型作为常识知识引擎,基于物体名称和原始图像-文本描述生成物体的丰富语言描述;2)设计上下文敏感查询,提升模型解读描述中蕴含复杂细微差异的能力,并强制模型关注上下文而非仅关注物体名称。在两个新物体检测基准LVIS和OminiLabel上,采用零样本检测设置,我们的方法分别实现了34.8 APr minival(+9.1)和29.3 AP(+3.6),大幅超越了此前最先进的模型GLIP和FIBER。