Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released at https://github.com/iSEE-Laboratory/Refer-Agent.

翻译：参考视频对象分割（RVOS）旨在根据文本查询分割视频中的对象。现有方法主要依赖于多模态大语言模型（MLLMs）的大规模监督微调（SFT）。然而，这种范式存在严重的数据依赖性，且面对MLLMs的快速演进时扩展性有限。尽管近期的零样本方法提供了一种灵活的替代方案，但由于其工作流程设计较为简单，其性能仍显著落后于基于SFT的方法。为应对这些局限，我们提出\textbf{Refer-Agent}，一种具有交替推理-反思机制的协作多智能体系统。该系统将RVOS分解为逐步推理过程。在推理阶段，我们引入一种从粗到精的帧选择策略，以确保帧的多样性与文本相关性，同时采用动态聚焦布局自适应地调整智能体的视觉焦点。此外，我们提出一种反思链机制，该机制利用提问者-应答者对生成自反思链，使系统能够验证中间结果并为下一轮推理优化生成反馈。在五个具有挑战性的基准测试上进行的大量实验表明，Refer-Agent显著优于最先进的方法，包括基于SFT的模型和零样本方法。此外，Refer-Agent具有灵活性，能够快速集成新的MLLMs而无需任何额外的微调成本。代码将在 https://github.com/iSEE-Laboratory/Refer-Agent 发布。