Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
翻译:基于语言的分割一直是计算机视觉领域的热门课题。尽管多模态大语言模型(MLLMs)的最新进展为分割系统赋予了推理能力,但这些方法仍受限于MLLMs内部冻结的知识,这制约了其在涉及实时信息或领域特定概念的实际场景中的应用潜力。本文提出\textbf{Seg-ReSearch},一种新颖的分割范式,旨在克服现有方法的知识瓶颈。通过实现交错推理与外部搜索,Seg-ReSearch使分割系统能够处理动态、开放世界的查询,这些查询超出了MLLMs冻结知识的范围。为有效训练此能力,我们引入了一种分层奖励设计,将初始引导与渐进激励相协调,从而缓解稀疏结果信号与僵化逐步监督之间的困境。为进行评估,我们构建了OK-VOS,这是一个明确要求外部知识进行视频对象分割的挑战性基准。在OK-VOS及两个现有推理分割基准上的实验表明,我们的Seg-ReSearch显著提升了现有最先进方法的性能。代码与数据将在https://github.com/iSEE-Laboratory/Seg-ReSearch发布。