While LISA effectively bridges the gap between segmentation and large language models to enable reasoning segmentation, it poses certain limitations: unable to distinguish different instances of the target region, and constrained by the pre-defined textual response formats. In this work, we introduce LISA++, an update to the existing LISA model, focusing on improving core functionalities while keeping the base architecture intact. The main enhancements in LISA++ include: \textbf{1) Enhanced Segmentation}: The instance segmentation ability has been added, providing a more detailed scene analysis along with the existing multi-region semantic segmentation. \textbf{2) More Natural Conversation}: Improved capability for multi-turn dialogue, with the ability to incorporate segmentation results directly into text responses, i.e., Segmentation in Dialogue (SiD). These improvements are achieved by curating the existing samples of generic segmentation datasets, aimed specifically at enhancing the segmentation and conversational skills without structural change and additional data sources. Comparative analysis with the original LISA model shows significant advancements in these areas, positioning LISA++ as a notable upgrade in visual understanding and interaction. LISA++'s adaptability and improved features highlight the versatility of the mask-as-embedding paradigm proposed by LISA, and the potential as a foundational model for diverse applications.
翻译:虽然LISA有效弥合了分割与大型语言模型之间的差距,实现了推理分割,但其存在若干局限性:无法区分目标区域的不同实例,且受限于预定义的文本响应格式。本文提出LISA++——对现有LISA模型的升级版,在保持基础架构不变的前提下聚焦核心功能优化。LISA++的主要改进包括:\textbf{1)增强型分割}:新增实例分割能力,在原有多区域语义分割基础上提供更精细的场景分析。\textbf{2)更自然的对话}:提升多轮对话能力,可直接将分割结果嵌入文本响应,即对话式分割(SiD)。这些改进通过对通用分割数据集的现有样本进行精炼实现,无需改变模型结构或引入额外数据源,专门针对分割与对话技能进行强化。与原LISA模型的对比分析表明,LISA++在这些领域取得显著进展,成为视觉理解与交互领域的重要升级。其卓越的适应性与改进特性凸显了LISA提出的掩码即嵌入范式的通用性,以及作为多领域应用基础模型的潜力。