Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
翻译:视频中的事件接地是视频分析的基本能力。尽管视觉语言模型(VLM)越来越多地用于此任务,现有方法主要仅在正向视频中训练模型将事件与时间戳关联。这种范式阻碍了VLM捕捉事件固有的时间结构和方向性,从而限制了鲁棒性和泛化能力。为解决此问题,受物理学中时间之箭(描述时间过程内在方向性)的启发,我们提出ArrowGEV,一种显式建模事件时间方向性的强化学习框架,以提升VLM的事件接地与时间方向理解能力。具体而言,我们将事件分为时间敏感型(如放下包)和时间不敏感型(如左手握毛巾)。前者指反转会显著改变其含义的事件,而后者在反转后语义不变。对于时间敏感事件,ArrowGEV引入奖励机制鼓励VLM区分正向与反向视频;对于时间不敏感事件,则强制跨两个方向进行一致性接地。大量实验表明,ArrowGEV不仅提升了接地精度和时间方向识别能力,还增强了通用视频理解与推理能力。