Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.
翻译:长视频理解仍是当前视频大语言模型面临的核心挑战。现有模型大多依赖于对均匀采样帧的静态推理,这会削弱时序定位能力,并导致长视频中的大量信息丢失。时序检索、空间缩放与时间缩放等智能工具通过支持对关键片段的自适应探索,为突破这些局限提供了自然途径。然而,构建智能视频理解数据需要模型本身已具备强大的长视频理解能力,这形成了循环依赖问题。我们通过VideoThinker应对这一挑战——这是一个完全基于合成工具交互轨迹训练的智能视频大语言模型。我们的核心思想是将视频转化为丰富的描述文本,并利用强大的智能语言模型在描述空间中生成多步骤工具使用序列。随后通过将描述文本替换为对应视频帧,将这些轨迹重新锚定至视频域,从而无需底层模型具备任何长视频理解能力,即可构建大规模的视频与工具推理交错数据集。在此合成智能数据集上的训练使VideoThinker具备了动态推理能力、自适应时序探索能力以及多步骤工具使用能力。值得注意的是,在多个长视频基准测试中,VideoThinker显著超越了仅基于描述文本的语言模型智能体与强大的视频模型基线,这证明了工具增强的合成数据以及自适应检索与缩放推理对于长视频理解的有效性。