Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.
翻译:零样本指代图像分割是一项具有挑战性的任务,其目标是根据给定的指代描述找到对应的实例分割掩码,而无需在该类配对数据上进行训练。目前的零样本方法主要聚焦于使用预训练判别模型(如CLIP)。然而,我们发现生成模型(如Stable Diffusion)已潜在地理解各种视觉元素与文本描述之间的关联,这一特性在该任务中鲜有研究。本文针对此任务提出了一种新颖的指代扩散分割器(Ref-Diff),其利用生成模型中的细粒度多模态信息。我们证明:无需候选框生成器,单独使用生成模型即可达到与现有最优弱监督模型相当的性能。当将生成模型与判别模型结合时,我们的Ref-Diff以显著优势超越这些对比方法。这表明生成模型对该任务同样有益,并能与判别模型互补以实现更优的指代分割。我们的代码已开源至https://github.com/kodenii/Ref-Diff。