Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.
翻译:扩散模型最近因其在语义分割任务中出色的迁移能力而受到越来越多的研究关注。然而,使用扩散模型生成精细分割掩码通常需要在标注数据集上进行额外训练,这导致预训练扩散模型自身对其生成图像的语义关系理解程度尚不明确。为解决这一问题,我们利用从Stable Diffusion(SD)中提取的语义知识,旨在开发一种无需任何额外训练即可生成精细分割图的图像分割器。主要困难在于,具有语义意义的特征图通常仅存在于空间维度较低的层次中,这给直接从这些特征图中提取像素级语义关系带来了挑战。为克服此问题,我们的框架通过利用SD的生成过程,识别图像像素与低维特征图空间位置之间的语义对应关系,并利用这些对应关系构建图像分辨率的分割图。大量实验表明,所生成的分割图轮廓清晰且捕捉到图像的细节部分,揭示了扩散模型中存在高度准确的像素级语义知识。