In recent years, the field of computer vision has seen significant advancements thanks to the development of large language models (LLMs). These models have enabled more effective and sophisticated interactions between humans and machines, paving the way for novel techniques that blur the lines between human and machine intelligence. In this paper, we introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions, allowing for a higher level of interactivity. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors to perform reasoning within the context of the user's instructions and the visual scene. This enables DetGPT to automatically locate the object of interest based on the user's expressed desires, even if the object is not explicitly mentioned. For instance, if a user expresses a desire for a cold beverage, DetGPT can analyze the image, identify a fridge, and use its knowledge of typical fridge contents to locate the beverage. This flexibility makes our system applicable across a wide range of fields, from robotics and automation to autonomous driving. Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines. We hope that our proposed paradigm and approach will provide inspiration to the community and open the door to more interative and versatile object detection systems. Our project page is launched at detgpt.github.io.
翻译:近年来,得益于大语言模型的快速发展,计算机视觉领域取得了显著进展。这些模型实现了人与机器之间更高效、更复杂的交互,催生了模糊人类与机器智能界限的新技术。本文提出了一种新的目标检测范式——基于推理的目标检测。与依赖特定目标名称的传统目标检测方法不同,我们的方法允许用户通过自然语言指令与系统交互,从而实现更高层次的互动性。所提出的方法DetGPT利用最先进的多模态模型和开放词汇目标检测器,在用户指令与视觉场景的语境中进行推理。这使得DetGPT能够根据用户表达的意图自动定位感兴趣的目标,即使该目标未被明确提及。例如,若用户表达想喝冷饮,DetGPT可分析图像、识别冰箱,并利用其对典型冰箱内容的认知定位饮料。这种灵活性使我们的系统可应用于从机器人与自动化到自动驾驶等广泛领域。总体而言,我们提出的范式与DetGPT展示了人类与机器之间更复杂且更直观的交互潜力。希望我们的范式与方法能为学界提供启发,并推动更交互、更通用的目标检测系统的发展。项目主页已于detgpt.github.io上线。