This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their powerful practical implementations.
翻译:本文研究基于人类反馈的强化学习(RLHF)框架中生成模型对齐过程的理论基础。我们考虑RLHF的标准数学形式——逆向KL正则化情境赌博机。尽管该形式在实际应用广泛,但其严格的理论分析仍存空白。我们分别在离线、在线及混合三种场景下研究其行为特性,并提出具有有限样本理论保证的高效算法。面向实际应用时,通过稳健近似信息论策略改进算子,我们的框架自然衍生出多种新型RLHF算法,包括针对在线场景的迭代式直接偏好优化(DPO)算法,以及适用于离线场景的多步拒绝采样策略。在大规模语言模型真实对齐实验中的评估结果表明,所提方法显著超越现有强基线(如DPO和拒绝采样优化(RSO)),充分展现了扎实理论基础与强大实践应用之间的深层关联。