We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.
翻译:本文通过解耦的“先推理后分割”流程研究细粒度指代图像分割任务。视觉语言模型接收图像和自然语言查询,对场景进行推理,并输出结构化空间提示:每个被指代实例的边界框及两个内部关键点。冻结的可提示分割器将这些提示转换为高质量掩码。在GenSeg-R1框架中,我们采用组相对策略优化对Qwen3-VL模型进行微调,无需监督式推理链标注。在RefCOCOg验证集上,最佳模型达到0.7127 cIoU与0.7382 mIoU,显著超越对应Qwen3-VL指导基线,并在相同评估条件下优于Seg-Zero-7B。我们进一步提出GenSeg-R1-G变体,该模型在GRefCOCO数据集上训练,采用SAM 2实时奖励机制直接优化掩码质量。在GRefCOCO验证集上,GenSeg-R1-G在负样本提示中达到82.40%准确率,显著优于不具备无目标检测能力的对比模型。在ReasonSeg测试集上,GenSeg-R1-4B达到68.40% mIoU,超越现有最优模型。