Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
翻译:思维链(CoT)推理已成为多模态大语言模型在视频理解任务中的强大工具。然而,其相对于直接回答的必要性和优势尚未得到充分探索。本文首先证明,对于经过强化学习训练的视频模型,直接回答通常能够匹配甚至超越CoT的性能,尽管CoT会以更高的计算成本生成逐步分析。受此启发,我们提出了VideoAuto-R1,一个采用“按需推理”策略的视频理解框架。在训练过程中,我们的方法遵循“思考一次,回答两次”的范式:模型首先生成一个初始答案,然后进行推理,最后输出一个经过复核的答案。两个答案均通过可验证的奖励进行监督。在推理过程中,模型利用初始答案的置信度分数来决定是否进行推理。在视频问答和定位基准测试中,VideoAuto-R1以显著提升的效率实现了最先进的准确率,将平均响应长度减少了约3.3倍(例如,从149个标记减少到仅44个标记)。此外,我们观察到在感知导向任务中思考模式的激活率较低,而在推理密集型任务中激活率较高。这表明基于语言的显式推理通常有益,但并非总是必要。