AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. \textsc{Proximal Policy Optimization} (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the \textit{formulation} of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.
翻译:以人类反馈强化学习(RLHF)为代表的人工智能对齐技术,正日益被视为高性能大语言模型的关键组成部分。近期文献将\textsc{近端策略优化}(PPO)定位为RLHF中强化学习环节的经典方法。然而,该方法同时具有计算成本高昂和超参数调优敏感的问题。我们认为,当初推动PPO发展的多数动机性原则在RLHF场景中的实际影响较小,并主张采用一种计算成本更低、同时能保持甚至提升性能的方法。我们在强化学习框架下重新审视人类偏好对齐的\textit{形式化表述},以简洁性为指导原则,证明PPO的诸多组件在RLHF场景中并非必要,而更为简单的REINFORCE风格优化变体,其表现优于PPO及近期提出的DPO、RAFT等"免强化学习"新方法。本研究揭示,通过精心适配大语言模型的对齐特性,能够以较低成本实现在线强化学习优化的优势。