In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.
翻译:本文提出InstructSAM,一个统一且精简的框架,用于在任意指令下进行多实例分割。我们将指令驱动的实例分割形式化为集合结构化的查询预测问题,并设计了一个显式的推理到实例的查询接口,该接口优雅地桥接了视觉语言模型与SAM3。具体而言,一组可学习的实例查询被注入视觉语言模型,并与指令及视觉信息进行上下文融合,使每个查询充当实例感知的槽位。混合注意力机制进一步促进这些查询、视觉令牌和指令令牌之间的交互,从而改善实例枚举并减少重复预测。由此产生的基于大语言模型的查询被投影到SAM3的检测器查询空间,以在单次前向传播中驱动精确的多实例分割。该设计在不修改SAM3核心架构的前提下,赋予其高级指令理解、组合推理以及实例级集合预测能力。为支持训练与评估,我们进一步构建了Inst2Seg——一个高质量、大规模、基于指令的实例分割数据集与基准,将自由形式的指令与实例级掩码相关联。大量实验表明,仅2B参数规模的InstructSAM在复杂指令驱动及短语级指代分割基准上均取得优异表现,优于此前端到端方法及SAM3的智能体流水线,同时实现了高效的单次多实例预测。