This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their powerful practical implementations.
翻译:本文研究了生成模型与基于人类反馈的强化学习对齐过程的理论框架。我们考虑一个标准的数学表述——用于逆向KL正则化情境赌博机的强化学习。尽管该方法在实践中应用广泛,但其严格的理论分析仍是一个开放问题。我们分别在离线、在线和混合三种不同场景下研究其行为,并提出了具备有限样本理论保证的高效算法。在向实际应用推进过程中,我们的框架通过对信息理论策略改进算子进行稳健近似,自然衍生出若干新型强化学习算法。这包括适用于在线场景的迭代版直接偏好优化算法,以及适用于离线场景的多步拒绝采样策略。针对大型语言模型的实际对齐实验表明,所提出的方法显著超越了现有的强基线方法(如DPO和拒绝采样优化),充分展现了扎实理论基础与其强大实际应用之间的内在联系。