ArrowGEV：通过时间箭头学习实现视频事件定位 (ArrowGEV: Grounding Events in Video via Learning the Arrow of Time)

Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.

翻译：视频事件定位是视频分析中的一项基础能力。尽管视觉语言模型（VLMs）越来越多地被用于此任务，但现有方法主要训练模型仅在前向视频中将事件与时间戳关联。这种范式阻碍了VLMs捕捉事件固有的时间结构和方向性，从而限制了其鲁棒性和泛化能力。为解决这一局限，受物理学中表征时间过程内在方向性的“时间箭头”概念启发，我们提出了ArrowGEV——一个强化学习框架，它显式地对事件中的时间方向性进行建模，以同时提升VLMs的事件定位能力和时间方向性理解。具体而言，我们将事件分为时间敏感型（例如“放下包”）和时间不敏感型（例如“左手拿着毛巾”）。前者指反转会显著改变其含义的事件，而后者在反转下语义保持不变。对于时间敏感事件，ArrowGEV引入一种奖励机制，鼓励VLMs区分前向与后向视频；对于时间不敏感事件，则强制要求模型在两个方向上保持一致的定位结果。大量实验表明，ArrowGEV不仅提高了定位精度和时间方向性识别能力，还增强了通用的视频理解与推理能力。