While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.
翻译:现有的大型视觉-语言多模态模型主要专注于整图理解,在实现区域特定理解方面存在显著空白。当前使用文本坐标或空间编码的方法往往无法为视觉提示提供友好的用户界面。为应对这一挑战,我们提出了一种新型多模态模型,能够解码任意视觉提示。这使得用户能够直观地标记图像,并通过诸如“红色边界框”或“指向箭头”等自然线索与模型交互。我们的简单设计直接将视觉标记叠加在RGB图像上,无需复杂的区域编码,同时在Visual7W、PointQA和视觉常识推理基准等区域理解任务上实现了最先进的性能。此外,我们提出了ViP-Bench,一个全面的基准测试,用于评估模型在多维度上理解视觉提示的能力,从而推动该领域的未来研究。代码、数据和模型均已公开。