Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Vision-Language models. Current Large Vision Language Models (LVLMs) are predominantly constrained to grounding a single, pre-existing object, relying solely on data from Referring Expression Comprehension tasks. The limitation leads to a compromise in model design, necessitating the introduction of visual expert models or the integration of customized head structures. Beyond these constraints, our research delves into the untapped potential of LVLMs and uncover their inherent capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel language-prompted localization dataset designed to fully unleash the capabilities of LVLMs in integrating fine-grained object perception with precise location awareness. More importantly, we present $\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the introduction of any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art performance on the fine-grained RefCOCO series but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.

翻译：复制人类天生能力——基于自由形式文本检测任意粒度下的所有物体——对视觉语言模型仍是一项艰巨挑战。当前大型视觉语言模型（LVLMs）主要受限于仅能定位单个预设物体，且完全依赖指代表达理解任务的数据。这一限制导致模型设计时需做出妥协，不得不引入视觉专家模型或集成定制化头部结构。突破这些束缚后，我们深入探索了LVLMs的未开发潜力，揭示了其在基本物体感知方面的固有能⼒——使模型能够准确识别并定位感兴趣的物体。基于这一发现，我们提出了一种全新的语言引导定位数据集，旨在充分释放LVLMs在融合细粒度物体感知与精准位置意识方面的能⼒。更重要的是，我们提出了$\textbf{Griffon}$——一个纯LVLMs基线的模型，无需引入任何特殊标记、专家模型或额外检测模块。它仅通过统一各类定位相关场景的数据格式，保持与主流LVLMs一致的结构，并通过精心设计的流程进行端到端训练。全面实验表明，$\textbf{Griffon}$不仅在细粒度RefCOCO系列上达到最先进性能，其检测能力也接近专家模型Faster RCNN在检测基准MSCOCO上的表现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日