Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
翻译:开放词汇目标检测(OVOD)旨在发展检测任意对象的能力。尽管大规模预训练工作已构建出展现卓越零样本能力的多功能基础模型以促进OVOD,但根据已有预训练基础模型建立面向任意对象认知的通用理解机制这一必要性常被忽视。为此,本文提出一种无需训练的猜图视觉语言模型GW-VLM,通过精心设计的多尺度视觉语言搜索(MS-VLS)与上下文概念提示(CCP)相结合,构建面向OVOD的通用理解范式。该方法能使预训练的视觉语言模型(VLM)与大语言模型(LLM)协同参与“猜图”任务:MS-VLS利用多尺度视觉语言软对齐机制,使VLM从类别无关目标检测结果中生成语义片段;CCP则能基于MS-VLS形成概念流,进而引导LLM理解这些片段以实现OVOD。最终,我们在自然场景与遥感数据集(包括COCO验证集、Pascal VOC、DIOR和NWPU-10)上进行了广泛实验,结果表明:相较于现有最优方法,所提出的GW-VLM无需任何训练步骤即可实现更优的OVOD性能。