Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.
翻译:视频大语言模型(Video-LLMs)仍易出现时空幻觉,常生成视觉上无依据的细节或错误的时间关系。现有缓解方法通常将幻觉视为统一的解码失败,并采用全局共享的修正规则。相反,我们观察到解码器层对视觉定位与后续语言组合的贡献不同,这表明干预必须具有分层感知特性。基于此洞察,我们提出STEAR——一种分层感知的时空证据干预框架。STEAR识别高风险解码步骤,并从定位敏感的中层选择基于令牌的视觉证据。该共享证据服务于两个耦合目标:恢复中层缺失的局部定位,并构建时间扰动的块级反事实样本以证伪后期解码中的不一致推理。因此,STEAR在高效的单编码推理框架内同时缓解了空间与时间幻觉。在代表性Video-LLM骨干网络及具有挑战性的基准测试中的实验表明,STEAR持续降低幻觉,同时提升了忠实性、时间一致性与鲁棒性。我们的结果证实,可靠的视频解码依赖于在正确层级对精确证据进行干预,而非施加全局惩罚。代码见补充材料。