Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.
翻译:摘要:零样本指代图像分割是一项具有挑战性的任务,其目标是在无需使用配对数据训练的情况下,根据给定的指代描述找到对应的实例分割掩码。当前的零样本方法主要集中于使用预训练的判别模型(例如CLIP)。然而,我们观察到生成模型(例如Stable Diffusion)已潜在理解多种视觉元素与文本描述之间的关系,但这种潜力在该任务中鲜有研究。本文针对该任务提出了一种新颖的指代扩散分割器(Ref-Diff),该模型利用生成模型中的细粒度多模态信息。我们证明,在无需提议生成器的情况下,仅凭生成模型即可达到与现有弱监督模型相当的性能。当我们将生成模型与判别模型结合使用时,Ref-Diff以显著优势超越了这些对比方法。这表明生成模型同样有利于该任务,并能补充判别模型以实现更优的指代分割。我们的代码已在https://github.com/kodenii/Ref-Diff 中公开。