We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
翻译:我们提出了一种新颖的基于自回归生成的图像分割范式(ARGenSeg),在统一框架内实现了多模态理解与像素级感知。以往将图像分割集成到多模态大语言模型(MLLMs)中的工作通常采用边界点表示或专用分割头。这些方法依赖于输入任务特定解码器的离散表示或语义提示,限制了MLLM捕捉细粒度视觉细节的能力。为解决这些挑战,我们引入了一种基于图像生成的MLLM分割框架,该框架能自然地生成目标对象的密集掩码。我们利用MLLM输出视觉标记,并通过通用VQ-VAE将其解码为图像,使分割完全依赖于MLLM的像素级理解。为降低推理延迟,我们采用下一尺度预测策略并行生成所需的视觉标记。大量实验表明,我们的方法在多个分割数据集上超越了先前的最先进方法,推理速度显著提升,同时保持了强大的理解能力。