Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.
翻译:CLIP与大型多模态模型(LMMs)的发展已实现开放词汇与自由文本分割,但现有模型仍需预定义类别提示,限制了自由形式的类别自生成能力。多数分割LMMs仍局限于稀疏预测,制约了其在开放集环境中的适用性。本研究提出ROSE——一种革命性的开放集密集分割大型多模态模型,通过分块感知机制实现密集掩码预测与开放类别生成。该方法将每个图像块视为独立的候选关注区域,使模型能同时预测密集与稀疏掩码。此外,新设计的指令-响应范式充分发挥LMMs的生成与泛化能力,实现脱离闭集约束或预定义类别的类别预测。为提升掩码细节与类别精度,我们引入基于对话的优化范式,将前序预测结果与文本提示相结合进行迭代修正。大量实验表明,ROSE在统一框架下于多种分割任务中均取得具有竞争力的性能。代码将公开释出。