UGround：基于展开式Transformer的统一视觉定位方法 (UGround: Towards Unified Visual Grounding with Unrolled Transformers)

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

翻译：本文提出UGround，一种**统一**的视觉**定位**范式，其通过动态选择展开式Transformer的中间层作为“掩码即提示”，区别于当前主流采用固定最后一层隐藏层作为“`<SEG>`即提示”的流程。UGround解决了当前主流范式存在的两个主要问题：(1) 对固定最后一层隐藏层的依赖，该方式会逐层累积传播误差且缺乏中间层校正机制；(2) 使用`<SEG>`作为提示符，这会将文本嵌入隐式映射到视觉空间而缺乏显式空间线索（如坐标）。UGround的核心是策略驱动的掩码提示机制，包含两个关键组件：随机跳跃连接与掩码即提示。随机跳跃连接是一种基于强化学习的策略，通过随机采样使每个`<SEG>`标记能在展开的Transformer层间滑动，实现动态选择与视觉模型（如SAM）以跳跃连接方式对接的层级。在选定隐藏层后，掩码即提示机制利用`<SEG>`标记与图像标记生成的相似度图作为软逻辑掩码来提示SAM生成掩码，通过其激活区域提供显式空间线索。为验证UGround的有效性，我们首次从属性视角将视觉定位任务统一在单一框架内，涵盖传统指代表达分割到新提出的推理分割、单目标到多目标、正查询到错误前提（空目标）等场景。所有代码与模型已在\href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}公开。