Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Vision-Language models. Current Large Vision Language Models (LVLMs) are predominantly constrained to grounding a single, pre-existing object, relying solely on data from Referring Expression Comprehension tasks. The limitation leads to a compromise in model design, necessitating the introduction of visual expert models or the integration of customized head structures. Beyond these constraints, our research delves into the untapped potential of LVLMs and uncover their inherent capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel language-prompted localization dataset designed to fully unleash the capabilities of LVLMs in integrating fine-grained object perception with precise location awareness. More importantly, we present $\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the introduction of any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art performance on the fine-grained RefCOCO series but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.
翻译:复制人类天生能力——基于自由形式文本检测任意粒度下的所有物体——对视觉语言模型仍是一项艰巨挑战。当前大型视觉语言模型(LVLMs)主要受限于仅能定位单个预设物体,且完全依赖指代表达理解任务的数据。这一限制导致模型设计时需做出妥协,不得不引入视觉专家模型或集成定制化头部结构。突破这些束缚后,我们深入探索了LVLMs的未开发潜力,揭示了其在基本物体感知方面的固有能⼒——使模型能够准确识别并定位感兴趣的物体。基于这一发现,我们提出了一种全新的语言引导定位数据集,旨在充分释放LVLMs在融合细粒度物体感知与精准位置意识方面的能⼒。更重要的是,我们提出了$\textbf{Griffon}$——一个纯LVLMs基线的模型,无需引入任何特殊标记、专家模型或额外检测模块。它仅通过统一各类定位相关场景的数据格式,保持与主流LVLMs一致的结构,并通过精心设计的流程进行端到端训练。全面实验表明,$\textbf{Griffon}$不仅在细粒度RefCOCO系列上达到最先进性能,其检测能力也接近专家模型Faster RCNN在检测基准MSCOCO上的表现。