UGround: Towards Unified Visual Grounding with Unrolled Transformers

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

翻译：本文提出UGround，一种\textbf{统一}的视觉\textbf{定位}范式，其通过动态选择展开式Transformer的中间层作为“掩码即提示”，区别于当前主流利用固定最后一层隐状态作为“\texttt{<SEG>}即提示”的流程。UGround解决了当前主流范式的两个核心挑战：(1) 对固定最后一层的依赖会逐层放大传播过程中的累积误差且缺乏中间层校正机制；(2) 使用\texttt{<SEG>}作为提示会隐式地将文本嵌入投影到视觉空间，而缺乏显式的空间线索（如坐标）。UGround的核心是策略驱动的掩码提示机制，包含两个关键组件：随机跳跃连接与掩码即提示。随机跳跃连接是一种通过随机采样的强化学习策略，使每个\texttt{<SEG>}标记能在展开的Transformer层间滑动，实现动态选择与视觉模型（如SAM）以跳跃连接方式对接的层级。在选定隐层的基础上，掩码即提示利用\texttt{<SEG>}标记与图像标记的相似度图作为软逻辑掩码来提示SAM生成掩码，通过其激活区域提供显式空间线索。为验证UGround的有效性，我们首次从属性视角将视觉定位任务统一至单一框架，涵盖从传统的指代表达分割到新提出的推理分割、从单目标到多目标、从正例查询到错误前提（空目标）场景。所有代码与模型均已公开于\href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}。