Recent advancements in Large Language Models have successfully transitioned towards System 2 reasoning, yet applying these paradigms to video understanding remains challenging. While prevailing research attributes failures in Video-LLMs to perceptual limitations, our empirical analysis reveals a cognitive misalignment termed Semantic Inertia, where models suppress valid visual evidence in favor of dominant language priors. To rectify this, we propose VISTA, a training-free framework designed to align perception with logical deduction. By dynamically routing inference paths and materializing implicit visual features into explicit textual anchors, our approach effectively counterbalances the influence of parametric knowledge. Furthermore, we incorporate a Latent Reasoning Consensus mechanism to mitigate stochastic hallucinations. VISTA showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models. Our codebase will be publicly available soon.
翻译:近期大语言模型已成功实现向系统2推理范式的转型,但将该范式应用于视频理解仍面临挑战。现有研究多将视频-LLM的失效归因于感知局限,而我们的实证分析揭示了一种认知错位现象——语义惯性,即模型倾向于压制有效视觉证据而屈从于主导性语言先验。为纠正此问题,我们提出VISTA框架,该免训练框架旨在实现感知与逻辑推演的对齐。通过动态路由推理路径并将隐式视觉特征具象化为显式文本锚点,本方法有效平衡了参数化知识的影响。此外,我们引入潜在推理共识机制以缓解随机幻觉。VISTA在广泛基准测试中表现卓越,在Egochema和VideoEspresso数据集上分别超越其基础模型9.3%和5.6%,性能比肩甚至超越规模更大、参数闭源的模型。我们的代码库即将开源。