Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.
翻译:基于人类反馈的强化学习(RLHF)已成为将大型语言模型(LLMs)与人类偏好对齐的可靠方法。在众多RLHF技术中,近端策略优化(PPO)是最广泛使用的方法之一。然而,尽管PPO广受欢迎,它可能面临模式崩溃、不稳定性和样本效率低下的问题。我们证明,这些问题可以通过一种我们称之为优势诱导策略对齐(APA)的新算法得到缓解,该算法利用基于估计优势的平方误差损失函数。我们通过实验证明,当使用单独的奖励模型作为评估器时,APA在语言任务中始终大幅优于PPO。此外,与PPO相比,APA对模型初始策略的偏差提供了更稳定的控制形式,确保模型在提升性能的同时不会陷入确定性输出。除实验结果外,我们还提供了支持我们损失函数设计的理论依据。