Recently, there have been explorations of generalist segmentation models that can effectively tackle a variety of image segmentation tasks within a unified in-context learning framework. However, these methods still struggle with task ambiguity in in-context segmentation, as not all in-context examples can accurately convey the task information. In order to address this issue, we present SINE, a simple image Segmentation framework utilizing in-context examples. Our approach leverages a Transformer encoder-decoder structure, where the encoder provides high-quality image representations, and the decoder is designed to yield multiple task-specific output masks to effectively eliminate task ambiguity. Specifically, we introduce an In-context Interaction module to complement in-context information and produce correlations between the target image and the in-context example and a Matching Transformer that uses fixed matching and a Hungarian algorithm to eliminate differences between different tasks. In addition, we have further perfected the current evaluation system for in-context image segmentation, aiming to facilitate a holistic appraisal of these models. Experiments on various segmentation tasks show the effectiveness of the proposed method.
翻译:近年来,研究者们开始探索通用型分割模型,这类模型能够在统一的上下文学习框架内有效处理多种图像分割任务。然而,这些方法在上下文分割中仍面临任务模糊性的挑战,因为并非所有上下文示例都能准确传达任务信息。为解决这一问题,我们提出了SINE——一个利用上下文示例的简易图像分割框架。该方法采用Transformer编码器-解码器结构:编码器提供高质量的图像表征,解码器则被设计为生成多个任务特定的输出掩码,以有效消除任务模糊性。具体而言,我们引入了上下文交互模块来补充上下文信息,并生成目标图像与上下文示例之间的关联;同时设计了匹配Transformer,通过固定匹配与匈牙利算法来消除不同任务间的差异。此外,我们进一步完善了当前上下文图像分割的评估体系,旨在促进对这些模型的整体性评价。在多种分割任务上的实验验证了所提方法的有效性。