This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its theoretical properties both in offline and online settings and propose efficient algorithms with finite-sample theoretical guarantees. Our work bridges the gap between theory and practice by linking our theoretical insights with existing practical alignment algorithms such as Direct Preference Optimization (DPO) and Rejection Sampling Optimization (RSO). Furthermore, these findings and connections also offer both theoretical and practical communities new tools and insights for future algorithmic design of alignment algorithms.
翻译:本文研究了基于人类反馈的强化学习(RLHF)在生成模型对齐过程中的理论框架。我们考虑了一种标准的数学表述——用于RLHF的反向KL正则化情境赌博机。尽管该表述在实际中得到了广泛应用,但其严格的理论分析仍是一个开放性课题。我们在离线与在线两种设置下研究了其理论性质,并提出了具有有限样本理论保证的高效算法。通过将理论洞见与现有实用对齐算法(如直接偏好优化DPO和拒绝采样优化RSO)相联系,本研究弥合了理论与实践之间的鸿沟。此外,这些发现与关联也为理论和实践界未来的对齐算法设计提供了新的工具与思路。