Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
翻译:基于人类反馈的强化学习已成为部署最新机器学习系统的重要技术手段与叙事工具。本书旨在为具备一定数理背景的读者提供该领域核心方法的入门指引。开篇追溯RLHF的起源——既涵盖近期文献,也探讨经济学、哲学与最优控制等多元学科领域的交汇融合。随后通过定义阐释、问题建模、数据收集及文献常用数学工具奠定理论基础。本书核心章节系统解析RLHF各优化阶段:从指令微调入门,到奖励模型训练,最终涵盖拒绝采样、强化学习及直接对齐算法等完整流程。末篇探讨前沿议题——合成数据与评估机制中尚未充分探索的研究问题——并提出该领域的开放性问题。