Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the stateof-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. The project website can be found at https://weijiawu.github.io/DiffusionMask/.
翻译:收集和标注具有逐像素标签的图像既耗时又费力。相比之下,使用生成模型(如DALL-E、Stable Diffusion)可以免费获取合成数据。在本文中,我们证明,仅使用文本-图像对训练的现成Stable Diffusion模型合成的图像,能够自动获取其精确的语义掩码。我们的方法名为DiffuMask,它挖掘了文本与图像之间交叉注意力图的潜力,从而自然无缝地将文本驱动的图像合成扩展到语义掩码生成。DiffuMask利用文本引导的交叉注意力信息来定位类别/单词特定区域,并结合实用技术创建新颖的高分辨率且类别可辨别的逐像素掩码。这些方法显著降低了数据收集和标注成本。实验表明,基于DiffuMask合成数据训练的现有分割方法,能在真实数据(VOC 2012、Cityscapes)上取得具有竞争力的性能。对于某些类别(如鸟类),DiffuMask展现出令人期待的表现,接近真实数据的最优结果(mIoU差距在3%以内)。此外,在开放词汇分割(零样本)设置下,DiffuMask在VOC 2012的未见类别上取得了新的最优结果。项目网站见https://weijiawu.github.io/DiffusionMask/。