LVAgent：通过多轮动态协作的多模态大语言模型智能体实现长视频理解 (LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents)

Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80\% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.

翻译：现有的多模态大语言模型（MLLMs）在建模长视频的时序上下文方面面临重大挑战。当前主流的基于智能体的方法通过外部工具辅助单一MLLM来回答长视频问题。尽管有此类工具支持，单一的MLLM仍只能提供对长视频的部分理解，导致性能有限。为了更好地处理长视频任务，我们提出了LVAgent，这是首个在长视频理解中实现多模态大语言模型智能体多轮动态协作的框架。我们的方法包含四个关键步骤：1）选择：我们根据不同的任务，从模型库中预先选择合适的智能体以组成最优智能体团队。2）感知：我们为长视频设计了一种高效的检索方案，在保持计算效率的同时提高关键时序片段的覆盖范围。3）行动：智能体回答长视频问题并交换推理依据。4）反思：我们评估每个智能体在每轮讨论中的表现，并优化智能体团队以实现动态协作。通过多模态大语言模型智能体的多轮动态协作，智能体迭代地完善其答案。LVAgent是首个在长视频理解任务中性能超越所有闭源模型（如GPT-4o）和开源模型（如InternVL-2.5和Qwen2-VL）的智能体系统方法。我们的LVAgent在四个主流长视频理解任务上达到了80\%的准确率。值得注意的是，LVAgent在LongVideoBench上将准确率提升了13.3\%。代码发布于https://github.com/64327069/LVAgent。