Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.
翻译:指代图像分割(RIS)旨在通过相应的自然语言表达精准分割图像中的指代对象,但其依赖成本高昂的掩膜标注。因此,弱监督RIS从图像-文本对中学习像素级语义,这在分割细粒度掩膜时颇具挑战性。提升分割精度的自然途径是借助图像分割基础模型SAM增强弱监督RIS。然而,我们发现简单地集成SAM收益有限,甚至因不可避免的噪声问题及对目标部件过度聚焦的挑战而导致性能退化。本文提出创新框架——点提示学习(PPT),并结合多源课程学习策略应对上述挑战。具体而言,PPT的核心是点生成器,该生成器不仅利用CLIP的文本-图像对齐能力与SAM强大的掩膜生成能力,还通过生成负向点提示从本质上有效解决噪声和过度聚焦问题。此外,我们引入基于以目标为中心的图像的课程学习策略,帮助PPT从简单精确的语义对齐逐步过渡至更复杂的RIS任务。实验表明,在RefCOCO、RefCOCO+和G-Ref数据集上,我们的PPT在mIoU指标上分别以11.34%、14.14%和6.97%的显著优势持续超越先前弱监督技术。