Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the attention maps, which in turn can then be used to derive the segmentation map. This enables SLiMe to segment any real-world image during inference with the granularity of the segmented region in the training image, using just one example. Moreover, leveraging additional training data when available, i.e. few-shot, improves the performance of SLiMe. We carried out a knowledge-rich set of experiments examining various design factors and showed that SLiMe outperforms other existing one-shot and few-shot segmentation methods.
翻译:利用大规模视觉-语言模型(如Stable Diffusion, SD)在下游任务(包括图像编辑、图像对应和三维形状生成)中已取得显著进展。受这些进步的启发,我们通过提出SLiMe,探索如何利用这些广泛的视觉-语言模型,在仅使用一个标注样本的情况下,以任意期望粒度进行图像分割。SLiMe将该问题形式化为一个优化任务。具体而言,给定一张训练图像及其分割掩码,我们首先从SD先验中提取注意力图,包括我们提出的新型"加权累积自注意力图"。随后,利用提取的注意力图,优化Stable Diffusion的文本嵌入,使得每个嵌入学习训练图像中的单个分割区域。这些学习到的嵌入进而突出注意力图中的分割区域,并可用于推导分割图。这使得SLiMe在推理时能够仅凭一个示例,以训练图像中分割区域的粒度对任意真实世界图像进行分割。此外,当有更多训练数据可用时(即少样本场景),利用这些数据可提升SLiMe的性能。我们通过一系列知识丰富的实验考察了多种设计因素,并表明SLiMe优于现有的一次性和少样本分割方法。