IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.

翻译：近期关于医学多模态大语言模型的研究逐渐从图像级理解转向细粒度的像素级理解。尽管分割是像素级理解的基础，现有方法仍面临两大挑战。首先，它们引入隐式的分割标记，并要求同时微调多模态大语言模型和外部像素解码器，这增加了灾难性遗忘的风险，并限制了模型在领域外场景的泛化能力。其次，大多数方法依赖单次推理，缺乏迭代优化分割结果的能力，导致性能欠佳。为克服这些局限，我们提出一种新型代理式多模态大语言模型——IBISAgent，将分割重新定义为以视觉为中心的多步决策过程。IBISAgent使多模态大语言模型能够生成交错的推理步骤与基于文本的点击动作，调用分割工具，并在不修改模型架构的情况下生成高质量掩码。通过对掩码图像特征进行迭代式多步视觉推理，IBISAgent天然支持掩码优化，并促进了像素级视觉推理能力的发展。我们进一步设计了一个两阶段训练框架，包含冷启动监督微调与基于细粒度定制奖励的代理式强化学习，从而增强了模型在复杂医学指代与推理分割任务中的鲁棒性。大量实验表明，IBISAgent在闭源与开源的最先进方法中均取得持续领先的性能。所有数据集、代码及训练模型将公开发布。