Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
翻译:大型多模态模型(LMMs)通过思维链(CoT)在视频推理任务中展现出卓越的能力。然而,其推理链的鲁棒性仍存疑问。本文识别出一种关键失效模式,称为文本惯性:一旦思维过程中出现文本幻觉,模型倾向于盲目遵循错误文本,而忽视与之冲突的视觉证据。为系统研究此问题,我们提出逻辑图扰动协议,该协议将结构化扰动注入到涵盖原生推理架构与提示驱动范式的多种LMM推理链中,以评估其自我反思能力。实验结果表明,模型仅在不足10%的情况下成功实现自我修正,且主要受困于盲目的文本错误传播。为缓解此问题,我们提出主动视觉上下文精炼——一种免训练的推理范式,其通过协调主动视觉重定位机制以执行细粒度验证,并结合自适应上下文精炼策略对推理历史进行总结与去噪。实验证明,该方法能显著抑制幻觉传播并提升推理鲁棒性。