Video-Thinker：通过强化学习激发“基于视频的思考” (Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning)

Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

翻译：近期，图像推理方法（尤其是“基于图像的思考”）在多模态大语言模型（MLLMs）中取得了显著成功；然而，这种动态推理范式尚未扩展到视频推理任务。本文提出Video-Thinker，它使MLLMs能够通过自主利用其内在的“定位”和“描述”能力，在推理过程中生成推理线索，从而实现基于视频的思考。为激发此能力，我们构建了Video-Thinker-10K，这是一个在思维链推理序列中精心编排了自主工具使用的数据集。我们的训练策略从监督微调（SFT）开始以学习推理格式，随后通过组相对策略优化（GRPO）来强化此推理能力。通过这种方法，Video-Thinker使得MLLMs能够为视频推理自主执行定位和描述任务，无需构建和调用外部工具。大量实验表明，Video-Thinker在领域内任务以及具有挑战性的领域外视频推理基准（包括Video-Holmes、CG-Bench-Reasoning和VRBench）上均取得了显著的性能提升。我们的Video-Thinker-7B模型大幅超越了现有基线（如Video-R1），并在7B规模的MLLMs中确立了最先进的性能。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日