Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
翻译:基于人类反馈的强化学习(RLHF)已成为部署最新机器学习系统的重要技术手段与叙事工具。本书旨在为具备一定数理背景的读者提供核心方法的简明导论。开篇追溯RLHF的起源——既涵盖近期文献,也探讨其如何融合经济学、哲学与最优控制等多元学科领域。随后通过定义阐释、问题建模、数据收集及文献常用数学工具奠定理论基础。本书核心章节系统解析RLHF各优化阶段:从指令微调入门,到奖励模型训练,最终涵盖拒绝采样、强化学习与直接对齐算法全流程。末章探讨前沿议题——合成数据与评估体系中尚未充分研究的问题——以及该领域的开放性问题。