Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process.
翻译:传统推理分割方法依赖于带有类别标签和简单描述的监督微调,这限制了其跨领域泛化能力,且缺乏显式推理过程。为解决上述局限性,本文提出Seg-Zero——一种通过认知强化实现显著泛化能力并生成显式思维链推理的新框架。Seg-Zero采用解码器架构,包含推理模型与分割模型两个模块:推理模型负责理解用户意图、生成显式推理链及位置提示,随后由分割模型基于这些提示生成精准的像素级掩码。我们设计了融合格式奖励与准确性奖励的精细化奖励机制,有效引导优化方向。通过仅使用GRPO强化学习训练且不依赖显式推理数据,Seg-Zero展现出鲁棒的零样本泛化能力,并涌现出测试时推理能力。实验表明,Seg-Zero-7B在ReasonSeg基准上达到57.5的零样本性能,较先前方法LISA-7B提升18%。这一显著进步凸显了Seg-Zero在呈现显式推理过程的同时跨领域泛化的能力。