In this paper, we study a challenging task of zero-shot referring image segmentation. This task aims to identify the instance mask that is most related to a referring expression without training on pixel-level annotations. Previous research takes advantage of pre-trained cross-modal models, e.g., CLIP, to align instance-level masks with referring expressions. %Yet, CLIP only considers image-text pair level alignment, which neglects fine-grained image region and complex sentence matching. Yet, CLIP only considers the global-level alignment of image-text pairs, neglecting fine-grained matching between the referring sentence and local image regions. To address this challenge, we introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework that is training-free and robust to various visual encoders. TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial rectifier for mask post-processing. Notably, the text-augmented visual-text matching score leverages a $P$ score and an $N$-score in addition to the typical visual-text matching score. The $P$-score is utilized to close the visual-text domain gap through a surrogate captioning model, where the score is computed between the surrogate model-generated texts and the referring expression. The $N$-score considers the fine-grained alignment of region-text pairs via negative phrase mining, encouraging the masked image to be repelled from the mined distracting phrases. Extensive experiments are conducted on various datasets, including RefCOCO, RefCOCO+, and RefCOCOg. The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
翻译:摘要:本文研究零样本指代图像分割这一具有挑战性的任务。该任务旨在无需依赖像素级标注训练的情况下,识别与指代表达最相关的实例掩码。现有研究利用预训练跨模态模型(如CLIP)对齐实例级掩码与指代表达,然而CLIP仅考虑图像-文本对的全局级对齐,忽略了指代句与局部图像区域的细粒度匹配。为解决此问题,我们提出一种无需训练且对各类视觉编码器鲁棒的文本增强空间感知(TAS)零样本指代图像分割框架。TAS包含用于实例级掩码提取的掩码提议网络、用于挖掘图像-文本关联的文本增强视觉文本匹配分数,以及用于掩码后处理的空间校正器。值得注意的是,文本增强视觉文本匹配分数在典型视觉文本匹配分数基础上,额外引入$P$分数和$N$分数:$P$分数通过代理字幕模型弥合视觉文本域差距,计算代理模型生成文本与指代表达之间的分数;$N$分数通过负短语挖掘实现区域-文本对的细粒度对齐,促使掩码图像远离被挖掘的干扰短语。我们在RefCOCO、RefCOCO+和RefCOCOg等多个数据集上进行了大量实验,结果表明所提方法显著优于现有最先进的零样本指代图像分割方法。