Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at https://github.com/ldong1111/VideoTG-R1.
翻译:视频时序定位(VTG)旨在根据语言查询在视频中定位精确片段,这是视频理解领域的一项基础性挑战。尽管近期的多模态大语言模型(MLLMs)已展现出通过强化学习(RL)应对VTG任务的潜力,但它们忽视了训练样本在质量和难度两方面带来的挑战。(1)部分标注样本:许多样本在标注区间之外仍包含相关片段,从而引入了模糊的监督信号。(2)难以定位样本:零样本性能较差的样本在RL训练过程中会产生持续偏低且难以区分的奖励值,导致模型对多个输出缺乏明确的偏好,从而降低了学习效率。为解决这些挑战,我们提出了VideoTG-R1,一种新颖的基于反射边界标注的课程强化学习框架,能够实现数据高效训练。具体而言,我们提出了一个边界反射智能体,它利用MLLMs预测标注区间之外与查询相关的时间戳,从而帮助我们识别并过滤掉部分标注样本,以减少模糊性。此外,我们引入了一个难度估计智能体来评估每个样本的训练难度,并设计了一种课程RL策略,该策略根据训练步数动态掩码难以定位样本的视频内容,从而降低训练难度并提供更清晰的偏好信号。在VTG及基于定位的视频问答任务上的实验证明了我们方法的有效性。值得注意的是,仅使用10%的训练样本和21%的计算预算,VideoTG-R1在组相对策略优化(GRPO)和监督微调(SFT)两种设置下均超越了使用全数据训练的对比模型。代码发布于 https://github.com/ldong1111/VideoTG-R1。