Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.
翻译:跨视频推理(CVR)已成为多模态智能领域的关键前沿任务,要求模型能够检索、对齐并聚合分散于多个视频中的证据。当前的多模态大语言模型(MLLMs)在应对CVR时往往面临挑战——简单的单次推理策略将多个视频编码为共享压缩上下文,可能掩盖罕见但关键的证据。本文提出AgentCVR,一种将CVR视为主动式证据获取任务的多智能体框架。AgentCVR采用主智能体(Master Agent)迭代协调专门的视觉与音频智能体进行靶向证据提取。为确保高效训练,我们引入基于脚本模拟的强化学习(Script-Simulated RL),通过大语言模型生成的语义脚本与轻量级文本模拟器优化智能体策略,从而规避在线探索阶段代价高昂的多模态推理。在综合性CVR基准上的实验结果表明,AgentCVR优于单次推理基线方法,并在复杂的跨视频对齐与定位任务中达到与封闭源顶级系统相媲美的性能。为保障可复现性,我们的代码开源于https://github.com/wang-jh24/AgentCVR。