This technical report outlines our method for generating a synthetic dataset for semantic segmentation using a latent diffusion model. Our approach eliminates the need for additional models specifically trained on segmentation data and is part of our submission to the CVPR 2024 workshop challenge, entitled CVPR 2024 workshop challenge "SyntaGen Harnessing Generative Models for Synthetic Visual Datasets". Our methodology uses self-attentions to facilitate a novel head-wise semantic information condensation, thereby enabling the direct acquisition of class-agnostic image segmentation from the Stable Diffusion latents. Furthermore, we employ non-prompt-influencing cross-attentions from text to pixel, thus facilitating the classification of the previously generated masks. Finally, we propose a mask refinement step by using only the output image by Stable Diffusion.
翻译:本技术报告阐述了我们利用潜在扩散模型生成语义分割合成数据集的方法。该方法无需额外训练专门用于分割数据的模型,是我们提交至CVPR 2024研讨会挑战赛"利用生成模型构建合成视觉数据集"的参赛方案。我们的方法通过自注意力机制实现了一种新颖的头部语义信息压缩,从而能够直接从Stable Diffusion的潜在空间中获取类别无关的图像分割。此外,我们利用不受提示词影响的文本到像素交叉注意力,对先前生成的掩码进行分类。最后,我们提出仅使用Stable Diffusion输出图像进行掩码优化的步骤。