ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation

Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.

翻译：指代表达式分割（Referring Expression Segmentation，RES）是一种核心的视觉-语言分割任务，它通过自由形式的语言表达实现对目标的像素级理解，并支撑人机交互与增强现实等关键应用。尽管基于多模态大语言模型（Multimodal Large Language Model，MLLM）的方法已取得进展，现有RES方法仍存在两个关键局限：首先，MLLM提供的粗糙边界框会导致冗余或非判别性的点提示；其次，当前普遍依赖的文本坐标推理并不可靠，因其难以将目标与视觉上相似的干扰物区分开。为解决这些问题，本文提出 \textbf{\model}，一种新颖的RES框架，它整合了\textbf{基于熵的点发现}（\textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery，\textbf{EBD}）与\textbf{基于视觉的推理}（\textbf{V}ision-\textbf{B}ased \textbf{R}easoning，\textbf{VBR}）。具体而言，EBD 通过对粗糙边界框内的空间不确定性进行建模，将点选择视为信息最大化过程，从而识别出高信息量的候选点。VBR 则通过联合的视觉-语义对齐来验证点的正确性，摒弃了纯文本的坐标推断，实现了更鲁棒的验证。基于这些组件，\model 实现了一个由粗到精的工作流程：边界框初始化、熵引导的点发现、基于视觉的验证以及掩码解码。在四个基准数据集（RefCOCO、RefCOCO+、RefCOCOg 和 ReasonSeg）上的大量评估表明，\model 在所有四个基准上均取得了新的最优性能，突显了其在使用最少提示的情况下生成准确且语义基础扎实的分割掩码的有效性。