Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
翻译:摘要:利用大规模视觉-语言模型(如稳定扩散模型SD)在下游任务(包括图像编辑、图像对应和三维形状生成)中取得了显著进展。受这些进展启发,我们通过提出SLiMe方法,探索利用这些广泛的视觉-语言模型,在仅使用一个标注样本的情况下,以任意所需粒度进行图像分割。SLiMe将此问题构建为优化任务。具体而言,给定单张训练图像及其分割掩码,我们首先从SD先验中提取注意力图,包括我们新颖的“加权累积自注意力图”。随后,利用提取的注意力图,对稳定扩散模型的文本嵌入进行优化,使得每个嵌入都能学习训练图像中单个分割区域的特征。这些学习到的嵌入随后在注意力图中突出显示分割区域,进而可用于推导分割掩码。这使得SLiMe在推理时能够仅凭一个示例,按照训练图像中分割区域的粒度对任何真实世界图像进行分割。此外,当有额外训练数据可用时(即少样本场景),利用这些数据可进一步提升SLiMe的性能。我们进行了一系列知识丰富的实验,考察了多种设计因素,结果表明SLiMe优于其他现有的一次性分割和少样本分割方法。