We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting.
翻译:我们提出DetToolChain这一新型提示范式,旨在激发多模态大语言模型(MLLMs,如GPT-4V和Gemini)的零样本目标检测能力。该方法包含一个基于高精度检测先验构建的检测提示工具包,以及一套用于实现这些提示的新型思维链。具体而言,工具包中的提示设计用于引导MLLM聚焦区域信息(如放大观察)、依据测量标准读取坐标(如叠加标尺与罗盘),以及从上下文信息中推理(如叠加场景图)。基于这些工具,新型检测思维链能够自动将任务分解为简单子任务、诊断预测结果并规划渐进式框体优化。我们框架的有效性在多种检测任务(尤其是困难案例)中得到了验证。与现有最优方法相比,采用我们DetToolChain的GPT-4V在开放词汇检测的MS COCO Novel类别集上提升了+21.5% AP50,在零样本指称表达理解的RefCOCO验证集上提升了+24.23%准确率,在D-cube描述物体检测的FULL设定下提升了+14.5% AP。