The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.
翻译:扩散模型不仅在图像生成领域取得了显著成就,还展示了其作为利用未标记数据的有效预训练方法的潜力。基于扩散模型在语义对应和开放词汇分割中展现的广泛潜力,我们的工作首次探索将潜在扩散模型应用于少样本语义分割。近年来,受大语言模型上下文学习能力的启发,少样本语义分割已演变为上下文分割任务,成为评估通用分割模型性能的关键要素。在此背景下,我们专注于少样本语义分割,为未来开发基于扩散的通用分割模型奠定坚实基础。我们首先关注如何促进查询图像与支持图像之间的交互,由此提出了一种自注意力框架内的KV融合方法。随后,我们深入研究了支持掩码信息的优化注入方式,同时重新评估了如何从查询掩码提供合理监督。基于分析,我们建立了一个名为DiffewS的简洁高效框架,最大程度保留了原始潜在扩散模型的生成架构,并有效利用了预训练先验知识。实验结果表明,我们的方法在多种设定下显著超越了先前的最先进模型。