Segment Everything Everywhere All at Once

In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig.1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from decoder to image features; and iv) Semantic-awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for open-vocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. Notably, our single SEEM model achieves competitive performance across interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity for generalization to novel prompts or their combinations, rendering it a readily universal image segmentation interface.

翻译：本文提出SEEM，一种可提示交互式模型，能同时分割图像中的所有内容（如图1所示）。SEEM提出一种新型解码机制，可支持多种类型分割任务的多样化提示，旨在构建类似大语言模型（LLM）的通用分割接口。具体而言，SEEM的设计遵循四个准则：i) 通用性。我们引入新式视觉提示，统一包括点、框、涂鸦和掩码在内的不同空间查询，并可泛化至不同参考图像；ii) 组合性。我们学习文本与视觉提示之间的联合视觉-语义空间，促进各类分割任务所需的两种提示类型的动态组合；iii) 交互性。我们进一步将可学习记忆提示融入解码器，通过掩码引导的交叉注意力机制从解码器到图像特征保留分割历史；iv) 语义感知。我们用文本编码器将文本查询和掩码标签编码至同一语义空间，实现开放词汇分割。通过全面实证研究，我们验证了SEEM在多种分割任务中的有效性。值得注意的是，仅需最小1/100的监督，单个SEEM模型即可在9个数据集上，在交互式分割、通用分割、指代分割和视频目标分割任务中取得具有竞争力的表现。此外，SEEM展现出对新型提示及其组合的卓越泛化能力，成为真正通用的图像分割接口。