This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their potent practical implementations.
翻译:本文研究生成模型与人类反馈强化学习(RLHF)的对齐过程。我们首先指出现有流行方法(如离线PPO和离线DPO)的主要挑战在于缺乏对环境的策略性探索。为理解RLHF的数学原理,我们采用标准数学表述——用于RLHF的反向KL正则化上下文赌博机。尽管该方法在实际应用中广泛使用,但其严格的数学理论分析仍属空白。我们在离线、在线和混合三种不同场景下研究其行为,并提出具有有限样本理论保证的高效算法。面向实际应用时,我们的框架通过对信息论策略改进预言机的鲁棒近似,自然衍生出多种新型RLHF算法,包括适用于在线场景的迭代式直接偏好优化(DPO)算法,以及适用于离线场景的多步拒绝采样策略。在大语言模型真实对齐实验中的实证评估表明,所提方法显著优于DPO和拒绝采样优化(RSO)等现有强基线,展示了坚实理论基础与强大实际应用之间的联系。