The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we pair the existing Stable Diffusion model with a novel grounding module, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we establish an automatic pipeline for constructing a dataset, that consists of {image, segmentation mask, text prompt} triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time; (iv) we adopt the augmented diffusion model to build a synthetic semantic segmentation dataset, and show that, training a standard segmentation model on such dataset demonstrates competitive performance on the zero-shot segmentation(ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks.
翻译:本文旨在从预训练的文本到图像扩散模型中提取视觉-语言对应关系,以分割图的形式实现,即针对文本提示中描述的视觉实体,同时生成图像和对应的分割掩码。我们做出了以下贡献:(i) 将现有的Stable Diffusion模型与一个新颖的接地模块配对,该模块可通过少量目标类别进行训练,以对齐扩散模型的视觉和文本嵌入空间;(ii) 建立了一个自动数据集构建流程,生成由{图像,分割掩码,文本提示}三元组组成的数据集,用于训练所提出的接地模块;(iii) 评估了在文本到图像扩散模型生成的图像上进行开放词汇接地的性能,结果表明该模块能够很好地分割训练阶段未见类别的目标物体;(iv) 采用增强的扩散模型构建合成语义分割数据集,并表明在此类数据集上训练标准分割模型在零样本分割(ZS3)基准测试中展现出竞争力的性能,这为将强大的扩散模型应用于判别性任务开辟了新途径。