Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed Istance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS.
翻译:指代图像分割(RIS)旨在根据与输入图像区域对齐的指代表达式生成分割掩码。然而,为该任务收集标注数据集通常成本高昂且劳动密集。为克服这一问题,我们提出了一种简单而有效的零样本指代图像分割方法,通过利用CLIP预训练的跨模态知识。为获取与输入文本对齐的分割掩码,我们设计了一种掩码引导的视觉编码器,能够捕获输入图像的全局和局部上下文信息。通过利用现成掩码提议技术生成的实例掩码,我们的方法能够分割细粒度的实例级对齐区域。我们还引入了一种全局-局部文本编码器,其中全局特征捕获整个输入表达式的复杂句子级语义,而局部特征聚焦于由依存句法分析器提取的目标名词短语。实验表明,所提方法在性能上显著优于多个零样本基线方法,甚至超过弱监督指代表达式分割方法。我们的代码已开源至 https://github.com/Seonghoon-Yu/Zero-shot-RIS。