In-context segmentation has drawn more attention with the introduction of vision foundation models. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. In this work, we explore this problem from a new perspective, using one representative generation model, the latent diffusion model (LDM). We observe a task gap between generation and segmentation in diffusion models, but LDM is still an effective minimalist for in-context segmentation. In particular, we propose two meta-architectures and correspondingly design several output alignment and optimization strategies. We have conducted comprehensive ablation studies and empirically found that the segmentation quality counts on output alignment and in-context instructions. Moreover, we build a new and fair in-context segmentation benchmark that includes both image and video datasets. Experiments validate the efficiency of our approach, demonstrating comparable or even stronger results than previous specialist models or visual foundation models. Our study shows that LDMs can also achieve good enough results for challenging in-context segmentation tasks.
翻译:上下文分割在视觉基础模型引入后引起了更多关注。现有方法大多采用度量学习或掩码图像建模来构建视觉提示与输入图像查询之间的相关性。本研究从一个新视角探索该问题,使用代表性生成模型——潜在扩散模型(LDM)。我们观察到扩散模型中生成与分割任务之间存在差距,但LDM仍然是上下文分割的有效简洁工具。具体而言,我们提出了两种元架构,并相应设计了多种输出对齐与优化策略。通过全面消融研究,我们实证发现分割质量取决于输出对齐与上下文指令。此外,我们构建了一个包含图像和视频数据集的新的公平上下文分割基准。实验验证了方法的有效性,表明其结果可与以往专用模型或视觉基础模型相媲美甚至更强。我们的研究表明,LDM在具有挑战性的上下文分割任务中也能取得足够好的效果。