Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
翻译:视频代理模型在复杂的视频-语言任务中取得了显著进展,但大多数代理方法仍重度依赖对密集采样视频帧的贪婪解析,导致计算成本高昂。我们提出VideoSeek——一种利用视频逻辑流主动搜索答案关键证据而非穷举解析完整视频的长时序视频代理。这一洞见使模型能够在使用更少帧数的同时保持甚至提升视频理解能力。VideoSeek采用“思考-行动-观察”循环机制,并配备精心设计的工具包以收集多粒度视频观测数据。该设计支持基于查询的累积观测探索,并实现实用的视频理解与推理。在四个具有挑战性的视频理解与推理基准上的实验表明,VideoSeek在显著减少帧数使用的同时取得了强劲的准确率——相较其基础模型GPT-5,VideoSeek在LVBench上实现了10.2个绝对百分点的性能提升,同时减少了93%的帧数使用。进一步分析凸显了利用视频逻辑流、强大推理能力以及工具包设计互补作用的重要性。