Do Language Models Understand Time?

Large language models (LLMs) have revolutionized video-based computer vision applications, including action recognition, anomaly detection, and video summarization. Videos inherently pose unique challenges, combining spatial complexity with temporal dynamics that are absent in static images or textual data. Current approaches to video understanding with LLMs often rely on pretrained video encoders to extract spatiotemporal features and text encoders to capture semantic meaning. These representations are integrated within LLM frameworks, enabling multimodal reasoning across diverse video tasks. However, the critical question persists: Can LLMs truly understand the concept of time, and how effectively can they reason about temporal relationships in videos? This work critically examines the role of LLMs in video processing, with a specific focus on their temporal reasoning capabilities. We identify key limitations in the interaction between LLMs and pretrained encoders, revealing gaps in their ability to model long-term dependencies and abstract temporal concepts such as causality and event progression. Furthermore, we analyze challenges posed by existing video datasets, including biases, lack of temporal annotations, and domain-specific limitations that constrain the temporal understanding of LLMs. To address these gaps, we explore promising future directions, including the co-evolution of LLMs and encoders, the development of enriched datasets with explicit temporal labels, and innovative architectures for integrating spatial, temporal, and semantic reasoning. By addressing these challenges, we aim to advance the temporal comprehension of LLMs, unlocking their full potential in video analysis and beyond.

翻译：大型语言模型（LLM）已彻底改变了基于视频的计算机视觉应用，包括动作识别、异常检测和视频摘要。视频本质上带来了独特的挑战，它结合了空间复杂性和时间动态性，而这些特性在静态图像或文本数据中是不存在的。当前利用LLM进行视频理解的方法通常依赖于预训练的视频编码器来提取时空特征，以及文本编码器来捕获语义信息。这些表征被整合在LLM框架内，使得跨多种视频任务的多模态推理成为可能。然而，一个关键问题依然存在：LLM是否真正理解时间的概念？它们对视频中时间关系的推理能力究竟如何？本文批判性地审视了LLM在视频处理中的作用，特别聚焦于其时间推理能力。我们指出了LLM与预训练编码器交互中的关键局限，揭示了它们在建模长期依赖关系以及抽象时间概念（如因果性和事件进展）方面的不足。此外，我们分析了现有视频数据集带来的挑战，包括偏见、缺乏时间标注以及领域特定限制，这些都制约了LLM的时间理解能力。为了弥补这些不足，我们探索了有前景的未来方向，包括LLM与编码器的协同进化、开发具有明确时间标签的丰富数据集，以及用于整合空间、时间和语义推理的创新架构。通过应对这些挑战，我们旨在推进LLM的时间理解能力，释放其在视频分析及其他领域的全部潜力。