Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include \textbf{reward models} to measure human preferences, \textbf{Proximal Policy Optimization} (PPO) to optimize policy model outputs, and \textbf{process supervision} to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO codes, aiming to make modest contributions to the advancement of LLMs.
翻译:大型语言模型(LLMs)为通用人工智能的发展描绘了蓝图。其主要目标是作为以人为中心(有益、诚实且无害)的助手。与人类对齐具有至关重要的意义,而基于人类反馈的强化学习(RLHF)成为支撑这一目标的关键技术范式。当前的技术路线通常包括:用于衡量人类偏好的**奖励模型**、用于优化策略模型输出的**近端策略优化**(PPO),以及用于提升逐步推理能力的**过程监督**。然而,由于奖励设计、环境交互和智能体训练等方面的挑战,加之大型语言模型巨大的试错成本,人工智能研究者们在推动LLM技术对齐和安全落地方面面临重大障碍。RLHF的稳定训练仍然是一个难题。在本报告中,我们剖析了RLHF的框架,重新评估了PPO的内部运行机制,并探讨了组成PPO算法的各部分如何影响策略智能体的训练。我们发现策略约束是PPO算法有效实施的关键因素。因此,我们探索了PPO-max(PPO算法的进阶版本),以有效提升策略模型的训练稳定性。基于主要结果,我们对RLHF能力与SFT模型及ChatGPT进行了全面分析。开源实现的缺失对LLM对齐的研究构成了重大挑战。因此,我们渴望发布技术报告、奖励模型和PPO代码,旨在为LLM的发展做出微薄贡献。