Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
翻译:近期,图像推理方法(尤其是“基于图像的思考”)在多模态大语言模型(MLLMs)中取得了显著成功;然而,这种动态推理范式尚未扩展到视频推理任务。本文提出Video-Thinker,它使MLLMs能够通过自主利用其内在的“定位”和“描述”能力,在推理过程中生成推理线索,从而实现基于视频的思考。为激发此能力,我们构建了Video-Thinker-10K,这是一个在思维链推理序列中精心编排了自主工具使用的数据集。我们的训练策略从监督微调(SFT)开始以学习推理格式,随后通过组相对策略优化(GRPO)来强化此推理能力。通过这种方法,Video-Thinker使得MLLMs能够为视频推理自主执行定位和描述任务,无需构建和调用外部工具。大量实验表明,Video-Thinker在领域内任务以及具有挑战性的领域外视频推理基准(包括Video-Holmes、CG-Bench-Reasoning和VRBench)上均取得了显著的性能提升。我们的Video-Thinker-7B模型大幅超越了现有基线(如Video-R1),并在7B规模的MLLMs中确立了最先进的性能。