Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.
翻译:长期以来,物体检测领域一直由基于传统坐标回归的模型(如YOLO、DETR和Grounding DINO)主导。尽管近期研究尝试利用多模态大语言模型(MLLMs)来解决此任务,但仍面临召回率低、预测重复、坐标错位等挑战。本研究旨在弥合这一差距,提出了Rex-Omni——一个30亿参数规模的多模态大语言模型,在物体感知任务上达到了最先进的性能。在COCO和LVIS等基准测试中,Rex-Omni在零样本设置下取得了与基于回归的模型(如DINO、Grounding DINO)相当甚至更优的性能。这一成果得益于三项关键设计:1)任务形式化:我们使用特殊令牌表示从0到999的量化坐标,降低了模型的学习难度,并提高了坐标预测的令牌效率;2)数据引擎:我们构建了多个数据引擎来生成高质量的定位、指代和指向数据,为训练提供了语义丰富的监督信息;3)训练流程:我们采用两阶段训练过程,结合了在2200万数据上的监督微调与基于GRPO的强化学习后训练。该强化学习后训练利用几何感知奖励,有效弥合了离散到连续坐标预测的差距,提升了边界框精度,并缓解了因初始监督微调阶段的教师引导特性而产生的重复预测等不良行为。除传统检测外,Rex-Omni固有的语言理解能力使其具备多样化的功能,如物体指代、指向、视觉提示、GUI定位、空间指代、光学字符识别及关键点检测,所有功能均在专用基准上进行了系统评估。我们相信,Rex-Omni为开发更具通用性和语言感知能力的视觉感知系统开辟了道路。