LISA: Reasoning Segmentation via Large Language Model

Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction to identify the target objects or categories before executing visual recognition tasks. Such systems lack the ability to actively reason and comprehend implicit user intentions. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving: 1) complex reasoning; 2) world knowledge; 3) explanatory answers; 4) multi-turn conversation. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. Experiments show our method not only unlocks new reasoning segmentation capabilities but also proves effective in both complex reasoning segmentation and standard referring segmentation tasks. Code, models, and demo are at https://github.com/dvlab-research/LISA.

翻译：尽管感知系统近年来取得了显著进展，但在执行视觉识别任务前，仍需依赖明确的人类指令来识别目标对象或类别。这类系统缺乏主动推理和理解隐含用户意图的能力。在本工作中，我们提出一项新的分割任务——推理分割。该任务旨在根据复杂且隐含的查询文本输出对应的分割掩码。此外，我们构建了一个包含超过一千组图像-指令对的基准数据集，其中融入了复杂推理与世界知识，用于评估目的。最后，我们提出LISA：大型语言指令分割助手，它既继承了多模态大型语言模型的语言生成能力，又具备生成分割掩码的能力。我们扩展原始词汇表，引入<SEG>标记，并提出嵌入即掩码范式以解锁分割能力。值得注意的是，LISA可处理涉及以下四种场景的案例：1）复杂推理；2）世界知识；3）解释性回答；4）多轮对话。同时，当仅在无推理数据集上训练时，它展现出强大的零样本能力。此外，仅使用239对推理分割图像-指令对进行微调，即可进一步提升性能。实验表明，我们的方法不仅解锁了新的推理分割能力，而且在复杂推理分割与标准指代分割任务中均表现有效。代码、模型及演示请访问https://github.com/dvlab-research/LISA。