Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{<SEG>}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.
翻译:[translated abstract in Chinese]
推理分割要求模型将复杂、隐式的文本查询精确定位到像素级掩码。现有方法依赖单个分割令牌 $\texttt{<SEG>}$,其隐藏状态隐式地编码了语义推理与空间定位,限制了模型显式解耦"分割什么"与"何处分割"的能力。我们提出AnchorSeg,它将推理分割重构为基于语言引导查询库的结构化条件生成过程,作用于图像令牌。AnchorSeg并非将所有语义推理和空间定位压缩至单个嵌入,而是构建有序的查询库序列:捕获中间语义状态的潜在推理令牌,以及提供显式空间定位的分割锚点令牌。我们将空间条件建模为图像令牌上的因式分解分布,其中锚点查询确定定位信号,而上下文查询提供语义调制。为桥接令牌级预测与像素级监督,我们提出令牌—掩膜循环一致性(TMCC),这是一种双向训练目标,强制跨分辨率对齐。通过结构化语言引导查询库显式解耦空间定位与语义推理,AnchorSeg在ReasonSeg测试集上取得了最优结果(67.7% gIoU和68.1% cIoU)。所有代码与模型已公开于https://github.com/rui-qian/AnchorSeg。