Secrets of RLHF in Large Language Models Part I: PPO

Rui Zheng,Shihan Dou,Songyang Gao,Yuan Hua,Wei Shen,Binghai Wang,Yan Liu,Senjie Jin,Qin Liu,Yuhao Zhou,Limao Xiong,Lu Chen,Zhiheng Xi,Nuo Xu,Wenbin Lai,Minghao Zhu,Cheng Chang,Zhangyue Yin,Rongxiang Weng,Wensen Cheng,Haoran Huang,Tianxiang Sun,Hang Yan,Tao Gui,Qi Zhang,Xipeng Qiu,Xuanjing Huang

Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include \textbf{reward models} to measure human preferences, \textbf{Proximal Policy Optimization} (PPO) to optimize policy model outputs, and \textbf{process supervision} to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO codes, aiming to make modest contributions to the advancement of LLMs.

翻译：大型语言模型（LLMs）为通用人工智能的发展描绘了蓝图。其主要目标是作为以人为中心（有益、诚实且无害）的助手。与人类对齐具有至关重要的意义，而基于人类反馈的强化学习（RLHF）成为支撑这一目标的关键技术范式。当前的技术路线通常包括：用于衡量人类偏好的**奖励模型**、用于优化策略模型输出的**近端策略优化**（PPO），以及用于提升逐步推理能力的**过程监督**。然而，由于奖励设计、环境交互和智能体训练等方面的挑战，加之大型语言模型巨大的试错成本，人工智能研究者们在推动LLM技术对齐和安全落地方面面临重大障碍。RLHF的稳定训练仍然是一个难题。在本报告中，我们剖析了RLHF的框架，重新评估了PPO的内部运行机制，并探讨了组成PPO算法的各部分如何影响策略智能体的训练。我们发现策略约束是PPO算法有效实施的关键因素。因此，我们探索了PPO-max（PPO算法的进阶版本），以有效提升策略模型的训练稳定性。基于主要结果，我们对RLHF能力与SFT模型及ChatGPT进行了全面分析。开源实现的缺失对LLM对齐的研究构成了重大挑战。因此，我们渴望发布技术报告、奖励模型和PPO代码，旨在为LLM的发展做出微薄贡献。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日