通过下一点预测实现万物检测 (Detect Anything via Next Point Prediction)

Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.

翻译：长期以来，物体检测领域一直由基于传统坐标回归的模型（如YOLO、DETR和Grounding DINO）主导。尽管近期研究尝试利用多模态大语言模型（MLLMs）来解决此任务，但仍面临召回率低、预测重复、坐标错位等挑战。本研究旨在弥合这一差距，提出了Rex-Omni——一个30亿参数规模的多模态大语言模型，在物体感知任务上达到了最先进的性能。在COCO和LVIS等基准测试中，Rex-Omni在零样本设置下取得了与基于回归的模型（如DINO、Grounding DINO）相当甚至更优的性能。这一成果得益于三项关键设计：1）任务形式化：我们使用特殊令牌表示从0到999的量化坐标，降低了模型的学习难度，并提高了坐标预测的令牌效率；2）数据引擎：我们构建了多个数据引擎来生成高质量的定位、指代和指向数据，为训练提供了语义丰富的监督信息；3）训练流程：我们采用两阶段训练过程，结合了在2200万数据上的监督微调与基于GRPO的强化学习后训练。该强化学习后训练利用几何感知奖励，有效弥合了离散到连续坐标预测的差距，提升了边界框精度，并缓解了因初始监督微调阶段的教师引导特性而产生的重复预测等不良行为。除传统检测外，Rex-Omni固有的语言理解能力使其具备多样化的功能，如物体指代、指向、视觉提示、GUI定位、空间指代、光学字符识别及关键点检测，所有功能均在专用基准上进行了系统评估。我们相信，Rex-Omni为开发更具通用性和语言感知能力的视觉感知系统开辟了道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日