Explore In-Context Segmentation via Latent Diffusion Models

In-context segmentation has drawn more attention with the introduction of vision foundation models. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. In this work, we explore this problem from a new perspective, using one representative generation model, the latent diffusion model (LDM). We observe a task gap between generation and segmentation in diffusion models, but LDM is still an effective minimalist for in-context segmentation. In particular, we propose two meta-architectures and correspondingly design several output alignment and optimization strategies. We have conducted comprehensive ablation studies and empirically found that the segmentation quality counts on output alignment and in-context instructions. Moreover, we build a new and fair in-context segmentation benchmark that includes both image and video datasets. Experiments validate the efficiency of our approach, demonstrating comparable or even stronger results than previous specialist models or visual foundation models. Our study shows that LDMs can also achieve good enough results for challenging in-context segmentation tasks.

翻译：上下文分割在视觉基础模型引入后引起了更多关注。现有方法大多采用度量学习或掩码图像建模来构建视觉提示与输入图像查询之间的相关性。本研究从一个新视角探索该问题，使用代表性生成模型——潜在扩散模型（LDM）。我们观察到扩散模型中生成与分割任务之间存在差距，但LDM仍然是上下文分割的有效简洁工具。具体而言，我们提出了两种元架构，并相应设计了多种输出对齐与优化策略。通过全面消融研究，我们实证发现分割质量取决于输出对齐与上下文指令。此外，我们构建了一个包含图像和视频数据集的新的公平上下文分割基准。实验验证了方法的有效性，表明其结果可与以往专用模型或视觉基础模型相媲美甚至更强。我们的研究表明，LDM在具有挑战性的上下文分割任务中也能取得足够好的效果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/