Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions. This indicates that there exists a strong correlation between the visual and textual domains. In addition, text-image discriminative models such as CLIP excel in image labelling from text prompts, thanks to the rich and diverse information available from open concepts. In this paper, we leverage these technical advances to solve a challenging problem in computer vision: camouflaged instance segmentation. Specifically, we propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues are subtle to distinguish the objects from the background, especially in segmenting novel objects which are not seen in training. We also develop technically supportive components to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. We validate our method and compare it with existing ones on several benchmark datasets of camouflaged instance segmentation and generic open-vocabulary instance segmentation. Experimental results confirm the advances of our method over existing ones. We will publish our code and pre-trained models to support future research.
翻译:文本到图像扩散技术已展现出从文本描述生成高质量图像的卓越能力,这表明视觉与文本域之间存在强相关性。此外,诸如CLIP之类的文本-图像判别模型得益于开放概念提供的丰富多样信息,在基于文本提示进行图像标注方面表现出色。本文利用这些技术进步来解决计算机视觉中的一个挑战性问题:伪装实例分割。具体而言,我们提出了一种基于最先进扩散模型的方法,该模型借助开放词汇能力学习多尺度文本-视觉特征以表征伪装目标。这种跨域表征对于分割视觉线索模糊(难以区分目标与背景)的伪装目标尤为理想,尤其适用于分割训练中未见过的新颖目标。我们还开发了技术性支持组件,以有效融合跨域特征,并引导相关特征聚焦于各前景目标。我们在多个伪装实例分割与通用开放词汇实例分割基准数据集上验证了该方法,并与现有方法进行了对比。实验结果证实了本方法相较于现有方法的先进性。我们将公开发布代码与预训练模型以支持未来研究。