Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT/GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the normalized reward of a batch as the baseline. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
翻译:基于人类反馈的强化学习(RLHF)在使大语言模型(LLM)与人类价值观和偏好对齐方面起着至关重要的作用。虽然像ChatGPT/GPT-4这样的前沿应用通常采用近端策略优化(PPO),但其中价值函数网络的引入带来了显著的计算开销。基于REINFORCE的方法,例如REINFORCE Leave One-Out(RLOO)、ReMax和组相对策略优化(GRPO),通过消除价值函数网络来解决这一限制。然而,这些方法在准确估计优势函数方面面临挑战。具体而言,它们独立估计针对每个提示生成响应的优势,这可能导致对较简单提示的过拟合以及对奖励攻击的脆弱性。为了应对这些挑战,我们提出了REINFORCE++,这是一种新颖的方法,它在移除价值函数模型的同时,使用批次归一化后的奖励作为基线。我们的实证评估表明,REINFORCE++在各种奖励模型上均表现出稳健的性能,且无需对提示集进行截断。此外,与现有的基于REINFORCE的方法相比,它在RLHF和长思维链(CoT)设置中均实现了更优的泛化能力。实现代码可在 https://github.com/OpenRLHF/OpenRLHF 获取。