This paper explores the weakly-supervised referring image segmentation (WRIS) problem, and focuses on a challenging setup where target localization is learned directly from image-text pairs. We note that the input text description typically already contains detailed information on how to localize the target object, and we also observe that humans often follow a step-by-step comprehension process (\ie, progressively utilizing target-related attributes and relations as cues) to identify the target object. Hence, we propose a novel Progressive Comprehension Network (PCNet) to leverage target-related textual cues from the input description for progressively localizing the target object. Specifically, we first use a Large Language Model (LLM) to decompose the input text description into short phrases. These short phrases are taken as target-related cues and fed into a Conditional Referring Module (CRM) in multiple stages, to allow updating the referring text embedding and enhance the response map for target localization in a multi-stage manner. Based on the CRM, we then propose a Region-aware Shrinking (RaS) loss to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages. Finally, we introduce an Instance-aware Disambiguation (IaD) loss to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image. Extensive experiments show that our method outperforms SOTA methods on three common benchmarks.
翻译:本文探讨弱监督指称图像分割问题,重点关注直接从图像-文本对学习目标定位的挑战性设定。我们注意到输入文本描述通常已包含如何定位目标对象的详细信息,同时观察到人类常遵循逐步理解过程(即渐进利用目标相关属性与关系作为线索)来识别目标对象。为此,我们提出一种新颖的渐进式理解网络,通过利用输入描述中的目标相关文本线索逐步定位目标对象。具体而言,我们首先使用大型语言模型将输入文本描述分解为短短语。这些短短语作为目标相关线索被分阶段输入条件指称模块,以多阶段方式更新指称文本嵌入并增强目标定位的响应图。基于该模块,我们进一步提出区域感知收缩损失,约束视觉定位在不同阶段以由粗到精的方式渐进执行。最后,我们引入实例感知消歧损失,通过区分同一图像上不同指称文本生成的重叠响应图来抑制实例定位歧义。大量实验表明,本方法在三个常用基准数据集上优于当前最优方法。