Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and these models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). PPO, however, is sensitive to hyperparameters and requires a minimum of four models in its standard implementation, which makes it hard to train. In contrast, we propose a novel learning paradigm called RRHF, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. RRHF can efficiently align language model output probabilities with human preferences as robust as fine-tuning and it only needs 1 to 2 models during tuning. In addition, RRHF can be considered an extension of SFT and reward models while being simpler than PPO in terms of coding, model counts, and hyperparameters. The entire alignment process can be accomplished within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca on Helpful and Harmless data, demonstrating performance comparable to PPO.
翻译:基于人类反馈的强化学习(RLHF)通过引导大型语言模型与人类偏好对齐,显著提升了人机交互质量。InstructGPT通过监督微调(SFT)、奖励模型训练和近端策略优化(PPO)等多个阶段实现RLHF。然而PPO对超参数敏感且标准实现需要至少四个模型,导致训练困难。为此,我们提出名为RRHF的新型学习范式,该范式对来自不同采样策略生成的响应进行评分,并通过排序损失学习模型输出与人类偏好的对齐。RRHF能够像微调一样稳健地将语言模型输出概率与人类偏好对齐,且在训练过程中仅需1-2个模型。此外,RRHF可视为SFT和奖励模型的扩展,同时在编码复杂度、模型数量和超参数设置方面均比PPO更简洁。整个对齐过程可通过单次RRHF训练完成。我们在Helpful and Harmless数据集上使用LLaMA和Alpaca对RRHF进行评估,证明其性能与PPO相当。