We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task. This task requires precisely localizing relevant video segments within long videos based on a given language query. TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning. To evaluate the effectiveness of TimeZero, we conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA. Code is available at https://github.com/www-Ye/TimeZero.
翻译:本文提出TimeZero,一种专为时间视频定位任务设计的推理引导LVLM。该任务要求根据给定的语言查询,在长视频中精确定位相关视频片段。TimeZero通过扩展推理过程来解决这一挑战,使模型能够仅通过强化学习来推理视频与语言之间的关系。为评估TimeZero的有效性,我们在两个基准数据集上进行了实验,其中TimeZero在Charades-STA数据集上取得了最先进的性能。代码发布于https://github.com/www-Ye/TimeZero。