We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.
翻译:本文提出SegLLM,一种新颖的多轮交互式推理分割模型,通过利用视觉与文本输出的对话记忆来增强基于大语言模型的分割能力。SegLLM借助掩码感知多模态大语言模型,将先前的分割结果重新整合至输入流中,使其能够推理复杂的用户意图,并在多轮交互中根据先前识别的实体(包括位置关系、交互关系和层次关系)进行目标分割。该能力使SegLLM能够以类对话方式响应视觉与文本查询。在新构建的MRSeg基准测试中,SegLLM在多轮交互推理分割任务上以超过20%的优势超越现有方法。此外,我们发现多轮推理分割数据的训练能提升标准单轮指代分割与定位任务的性能,使指代表达式分割的cIoU提升5.5%,指代表达式定位的Acc@0.5提升4.5%。