Existing works of reasoning segmentation often fall short in complex cases, particularly when addressing complicated queries and out-of-domain images. Inspired by the chain-of-thought reasoning, where harder problems require longer thinking steps/time, this paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results, in the same way humans approach harder questions. We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction. Instead of fine-tuning, CoT-Seg leverages the inherent reasoning ability of pre-trained MLLMs (GPT-4o) to decompose queries into meta-instructions, extract fine-grained semantics from images, and identify target objects even under implicit or complex prompts. Moreover, CoT-Seg incorporates a self-correction stage: the model evaluates its own segmentation against the original query and reasoning trace, identifies mismatches, and iteratively refines the mask. This tight integration of reasoning and correction significantly improves reliability and robustness, especially in ambiguous or error-prone cases. Furthermore, our CoT-Seg framework allows easy incorporation of retrieval-augmented reasoning, enabling the system to access external knowledge when the input lacks sufficient information. To showcase CoT-Seg's ability to handle very challenging cases ,we introduce a new dataset ReasonSeg-Hard. Our results highlight that combining chain-of-thought reasoning, self-correction, offers a powerful paradigm for vision-language integration driven segmentation.
翻译:现有的推理分割方法在处理复杂场景时常显不足,尤其是在应对复杂查询和域外图像时。受思维链推理(即更困难的问题需要更长的思考步骤/时间)的启发,本文旨在探索一种能够逐步思考、在需要时查找信息、生成结果、自我评估结果并优化结果的系统,其方式类似于人类处理更困难问题的方式。我们提出了CoT-Seg,一个无需训练即可使用的框架,它通过将思维链推理与自我校正相结合,重新思考了推理分割问题。CoT-Seg不依赖于微调,而是利用预训练多模态大语言模型(如GPT-4o)固有的推理能力,将查询分解为元指令,从图像中提取细粒度语义,并能在隐含或复杂提示下识别目标物体。此外,CoT-Seg引入了一个自我校正阶段:模型根据原始查询和推理轨迹评估其自身分割结果,识别不匹配之处,并迭代地优化掩码。这种推理与校正的紧密集成显著提高了系统的可靠性和鲁棒性,尤其是在处理模糊或易出错的情况时。进一步地,我们的CoT-Seg框架可以轻松集成检索增强推理,使系统能够在输入信息不足时访问外部知识。为了展示CoT-Seg处理极具挑战性案例的能力,我们引入了一个新的数据集ReasonSeg-Hard。我们的结果表明,结合思维链推理与自我校正,为视觉-语言融合驱动的分割提供了一个强大的范式。