Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
翻译:摘要:利用大规模视觉-语言模型(如稳定扩散模型,Stable Diffusion, SD)在下游任务中(包括图像编辑、图像对应和三维形状生成)已取得了显著进展。受这些进展启发,我们探索利用这些广泛的视觉-语言模型,通过仅需一个标注样本即可实现任意粒度级别的图像分割,并提出SLiMe方法。SLiMe将这一问题转化为优化任务。具体而言,给定单张训练图像及其分割掩码,我们首先从SD先验中提取注意力图,包括我们提出的新型“加权累积自注意力图”。随后,利用提取的注意力图,对Stable Diffusion的文本嵌入进行优化,使得每个嵌入能学习训练图像中单个分割区域的特征。这些学习到的嵌入随后在注意力图中突出显示分割区域,进而可用于推导分割图。这使得SLiMe在推理阶段,仅需依赖训练图像中分割区域的粒度,即可对任意真实世界图像进行分割,且仅需一个示例。此外,当有更多训练数据(即少样本场景)可用时,SLiMe的性能可进一步提升。我们开展了富含知识性的实验,考察了多种设计因素,并表明SLiMe优于现有的一次性分割和少样本分割方法。