Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Github: https://github.com/yuhangzang/ContextDET.
翻译:近期多模态大语言模型在图像描述、视觉问答等视觉-语言任务中表现出色,但缺乏基础感知能力,即目标检测能力。针对这一局限,本文提出"上下文目标检测"这一全新研究问题——理解不同人机交互场景中的可见物体。我们重点研究三种典型交互场景:语言完形填空、视觉描述生成与视觉问答。为此,我们提出统一多模态模型ContextDET,该模型支持端到端可微分的视觉-语言上下文建模,能够实现人机交互中视觉对象与语言输入的定位、识别与关联。ContextDET包含三个关键子模型:(i) 用于提取视觉表征的视觉编码器,(ii) 用于多模态上下文解码的预训练大语言模型,(iii) 根据上下文目标词预测边界框的视觉解码器。这种新颖的"先生成后检测"框架使我们能够检测人类词汇中的目标词。大量实验表明,ContextDET在我们提出的CODE基准测试、开放词汇检测和指代图像分割任务中均展现出显著优势。代码仓库:https://github.com/yuhangzang/ContextDET。