GDPO：面向多奖励强化学习优化的组奖励解耦归一化策略优化 (GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization)

Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov

from arxiv, NVIDIA-Tech Report

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

翻译：随着语言模型能力不断增强，用户不仅期望其提供准确响应，还要求其行为能在多样场景中符合不同的人类偏好。为实现这一目标，强化学习（RL）流程开始引入多个奖励信号，每个奖励捕获一种特定偏好，以引导模型产生期望行为。然而，近期研究默认在多奖励设置下直接应用组相对策略优化（GRPO），而未检验其适用性。本文证明，直接应用GRPO对不同轨迹奖励组合进行归一化会导致其坍缩为相同的优势值，从而降低训练信号的分辨率，导致次优收敛甚至在某些情况下引发早期训练失败。为此，我们提出组奖励解耦归一化策略优化（GDPO），这是一种通过解耦个体奖励归一化过程来解决上述问题的新策略优化方法。该方法能更真实地保留奖励间的相对差异，实现更精确的多奖励优化，并显著提升训练稳定性。我们在工具调用、数学推理和代码推理三项任务中对比GDPO与GRPO，同时评估正确性指标（准确率、错误率）和约束遵循指标（格式、长度）。在所有实验设置下，GDPO均持续优于GRPO，证明了其在多奖励强化学习优化中的有效性和泛化能力。