Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method even achieves performance improvements in general video temporal grounding tasks, suggesting that temporal logic consistency is an important factor in temporal understanding.
翻译:大语言模型(LLMs)常产生自相矛盾的输出,严重损害其可靠性并阻碍实际应用。在视频-语言模型(Video-LLMs)领域,该现象近期引起了研究者的关注:具体而言,这些模型无法根据自身定位结果,对改写后的问题给出逻辑一致的响应。然而,该现象的根本原因尚未得到充分探索。本研究采用可解释性驱动的方法,分析、统计总结并干预该现象的潜在因素。我们发现,响应不一致的主要原因之一在于跨模态注意力头无法有效区分不同时间戳的视频标记。为此,我们提出一种名为时序条件注意力锐化(Temporally Conditioned Attention Sharpening,TCAS)的注意力增强方法,通过构建基于注意力差异的增强目标,提升模型的时间分辨率能力,从而增强其时序逻辑一致性。实验结果表明,我们的方法显著提升了Video-LLMs的时序逻辑一致性。进一步分析表明,该方法确实改善了注意力头的时序区分能力,验证了我们的结论。此外,我们的方法甚至在通用视频时序定位任务上也取得了性能提升,这表明时序逻辑一致性是时序理解的重要影响因素。