LISA: Reasoning Segmentation via Large Language Model

Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction to identify the target objects or categories before executing visual recognition tasks. Such systems lack the ability to actively reason and comprehend implicit user intentions. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving: 1) complex reasoning; 2) world knowledge; 3) explanatory answers; 4) multi-turn conversation. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. Experiments show our method not only unlocks new reasoning segmentation capabilities but also proves effective in both complex reasoning segmentation and standard referring segmentation tasks. Code, models, and demo are at https://github.com/dvlab-research/LISA.

翻译：尽管感知系统近年来取得了显著进展，但在执行视觉识别任务之前，仍需依赖明确的用户指令来识别目标对象或类别。这类系统缺乏主动推理和理解隐含用户意图的能力。本文提出了一项新的分割任务——推理分割。该任务旨在根据复杂且隐含的查询文本输出分割掩码。此外，我们建立了一个包含超过一千个图像-指令对的基准数据集，并融入复杂推理与世界知识用于评估。最后，我们提出了LISA：大规模语言指导分割助手（large Language Instructed Segmentation Assistant），它继承了多模态大语言模型（LLM）的语言生成能力，同时具备生成分割掩码的功能。我们通过扩展原始词汇表引入<SEG>标记，并提出“嵌入即掩码”范式来激活分割能力。值得注意的是，LISA可处理以下场景：1) 复杂推理；2) 世界知识；3) 解释性回答；4) 多轮对话。同时，当仅在无推理需求的数据集上训练时，它展现出强大的零样本能力。此外，仅用239个推理分割图像-指令对微调模型即可进一步提升性能。实验表明，我们的方法不仅解锁了新型的推理分割能力，还在复杂推理分割与标准指代分割任务中均表现出有效性。代码、模型及演示请访问：https://github.com/dvlab-research/LISA。