Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
翻译:视频理解不仅需要识别视觉内容,还需要对长时且含噪声的观测进行时序定位的多步推理。我们提出视频思维过程推理框架,该框架通过将视频推理结构化为一系列轻量级、可验证的步骤,使推理过程显式化。PoT交替执行(i)时序证据选择,(ii)逐步状态更新,以及(iii)约束性答案合成,使模型能够在保持与视频证据可追溯性的同时逐步优化假设。该框架设计为模型无关,可嵌入现有视觉-语言骨干网络,支持闭卷推理和借助外部工具的增强证据推理。我们进一步为PoT轨迹引入了统一表示,将中间决策与时间片段对齐,从而提升了对干扰因素的鲁棒性并减少了幻觉解释。在标准视频推理任务上的大量实验表明,PoT持续提升了事实准确性与时序定位能力,同时为诊断和下游应用提供了可解释的推理轨迹。