Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement learning algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) remains underexplored. In this paper, we explore GRPO and identify two issues that hinder effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function as a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as clipping and min operations. This directly aligns the model with the advantages, providing guidance to prefer better outputs. The difficulty-aware data augmentation strategy augments input prompts/videos to target solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
翻译:近期研究表明,基于强化学习(RL)的后训练能有效提升大语言模型(LLMs)的推理能力。其中,群体相对策略优化(GRPO)采用PPO风格的强化学习算法与群体归一化奖励,已展现出显著成效。然而,GRPO在视频大语言模型(VideoLLMs)中的有效性尚未得到充分探索。本文深入研究了GRPO,并识别出两个阻碍有效学习的问题:(1)对安全机制的依赖,以及(2)优势函数消失。为应对这些挑战,我们提出了DeepVideo-R1,这是一个采用回归GRPO(Reg-GRPO)与难度感知数据增强进行训练的视频大语言模型。Reg-GRPO将GRPO的损失函数重构为直接预测GRPO中优势值的回归任务,从而无需依赖裁剪(clipping)和最小值(min)操作等安全机制。这使模型直接与优势函数对齐,引导其偏好更优的输出。难度感知数据增强策略通过增强输入提示/视频,使其达到可解决的难度水平,从而产生多样化的奖励信号。实验结果表明,我们的方法在多个基准测试中显著提升了视频推理性能。