We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF$\to$RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: https://simpleseg.github.io/
翻译:我们提出了SimpleSeg,一种极其简单却高效的方法,旨在赋予多模态大语言模型(MLLMs)原生的像素级感知能力。我们的方法将分割任务重新定义为简单的序列生成问题:模型直接预测描述物体边界的点序列(文本坐标),整个过程完全在其语言空间内完成。为实现高保真度,我们引入了一个两阶段的SF→RL训练流程,其中基于交并比奖励的强化学习对点序列进行优化,以精确匹配真实轮廓。我们发现,标准的MLLM架构本身就具备强大的低级感知能力,无需任何专用架构即可解锁。在分割基准测试中,SimpleSeg的性能与依赖复杂、任务专用设计的方法相当,甚至常常超越它们。这项工作表明,精确的空间理解可以从简单的点预测中涌现,挑战了当前对辅助组件的普遍需求,并为开发更统一、更强大的视觉语言模型铺平了道路。项目主页:https://simpleseg.github.io/