Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
翻译:思维链(CoT)推理在解决自然语言处理中的复杂任务方面取得了显著成功,而最近的多模态大语言模型(MLLMs)已将这一范式扩展到视频推理领域。然而,这些模型通常依赖于冗长的推理链和大量的输入视觉标记。基于我们基准研究的实证观察,我们假设简洁的推理结合减少的视觉标记集可能足以实现有效的视频推理。为了验证这一假设,我们设计并验证了一个高效的训练后与推理框架,该框架增强了视频MLLM的推理能力。我们的框架使模型能够在压缩的视觉标记上运行,并在回答前生成简短的推理轨迹。由此产生的模型显著提高了推理效率,在多样化的基准测试中实现了具有竞争力的性能,并且避免了依赖人工CoT标注或监督微调。总体而言,我们的结果表明,对于通用视频推理而言,冗长、类人的CoT推理可能并非必需,而简洁的推理既有效又高效。我们的代码将在https://github.com/LaVi-Lab/Rethink_CoT_Video发布。