We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.
翻译:本文提出SegLLM,一种新颖的多轮交互式推理分割模型,该模型通过利用视觉与文本输出的对话记忆来增强基于LLM的分割能力。通过采用掩码感知的多模态LLM,SegLLM将先前的分割结果重新整合到输入流中,使其能够推理复杂的用户意图,并根据先前识别的实体(包括位置关系、交互关系和层次关系)在多轮交互中进行对象分割。这一能力使SegLLM能够以类对话方式响应视觉和文本查询。在新构建的MRSeg基准测试中,SegLLM在多轮交互推理分割任务上以超过20%的优势超越现有方法。此外,我们观察到在多轮推理分割数据上进行训练能够提升标准单轮指代分割与定位任务的性能,使指代表达式分割的cIoU提升5.5%,指代表达式定位的Acc@0.5提高4.5%。