Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
翻译:近期研究表明,基于强化学习(RL)的后训练方法能有效提升大语言模型(LLMs)的推理能力。其中,组相对策略优化(GRPO)采用PPO风格的强化算法与组归一化奖励机制,已展现出显著成效。然而,GRPO在视频大语言模型(VideoLLMs)中的有效性仍缺乏深入研究。本文探讨了GRPO在视频领域的应用,并发现两个削弱有效学习的问题:(1)对安全机制的依赖,以及(2)优势函数消失。为应对这些挑战,我们提出了DeepVideo-R1——一个通过回归GRPO(Regressive GRPO)与难度感知数据增强训练的视频大语言模型。Reg-GRPO将GRPO损失函数重构为直接预测GRPO中优势值的回归任务,从而无需依赖裁剪函数与最小值函数等安全机制。该方法直接使模型与优势函数对齐,引导模型偏好更优解。难度感知数据增强策略通过对输入提示/视频进行增强,将样本难度定位在可求解的难度区间,从而产生多样化的奖励信号。实验结果表明,我们的方法在多个基准测试中显著提升了视频推理性能。