Preparing training data for deep vision models is a labor-intensive task. To address this, generative models have emerged as an effective solution for generating synthetic data. While current generative models produce image-level category labels, we propose a novel method for generating pixel-level semantic segmentation labels using the text-to-image generative model Stable Diffusion (SD). By utilizing the text prompts, cross-attention, and self-attention of SD, we introduce three new techniques: \textit{class-prompt appending}, \textit{class-prompt cross-attention}, and \textit{self-attention exponentiation}. These techniques enable us to generate segmentation maps corresponding to synthetic images. These maps serve as pseudo-labels for training semantic segmenters, eliminating the need for labor-intensive pixel-wise annotation. To account for the imperfections in our pseudo-labels, we incorporate uncertainty regions into the segmentation, allowing us to disregard loss from those regions. We conduct evaluations on two datasets, PASCAL VOC and MSCOCO, and our approach significantly outperforms concurrent work. Our benchmarks and code will be released at https://github.com/VinAIResearch/Dataset-Diffusion
翻译:为深度视觉模型准备训练数据是一项劳动密集型任务。为此,生成模型已成为一种生成合成数据的有效解决方案。虽然当前生成模型可生成图像级类别标签,但我们提出了一种新方法,利用文本到图像生成模型Stable Diffusion(SD)生成像素级语义分割标签。通过利用SD的文本提示、交叉注意力和自注意力机制,我们引入了三项新技术:**类别提示追加**、**类别提示交叉注意力**和**自注意力幂化**。这些技术使我们能够生成与合成图像对应的分割图,这些分割图可作为训练语义分割器的伪标签,从而避免劳动密集型的像素级标注。针对伪标签的不完美性,我们在分割中引入不确定性区域,以忽略这些区域的损失。我们在PASCAL VOC和MSCOCO两个数据集上进行了评估,所提方法显著优于同期工作。我们的基准测试和代码将发布在https://github.com/VinAIResearch/Dataset-Diffusion。