Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.
翻译:基于人类反馈的强化学习(RLHF)在将大语言模型(LLMs)与人类偏好对齐方面发挥着重要作用。虽然带期望奖励约束的RLHF可表述为原始对偶优化问题,但标准原始对偶方法仅能保证在分布策略下的收敛性,此时鞍点问题呈凸凹形式。此外,在实际应用中,标准原始对偶方法在策略参数化下可能表现出末次迭代的不稳定性或发散性。本研究提出一个用于安全RLHF的通用原始对偶框架,该框架统一了包括安全RLHF、单轮及多轮方法在内的广泛现有对齐算法。基于此框架,我们提出一种乐观原始对偶(OPD)算法,该算法通过对原始变量和对偶变量同时引入预测性更新来稳定鞍点动态。我们为所提方法建立了末次迭代收敛保证,涵盖分布空间中的精确策略优化,以及在参数化策略下收敛到最优解邻域的情况——该邻域的间隙与近似误差及偏差相关。我们的分析表明,乐观机制在缓解约束对齐目标固有的振荡方面起着关键作用,从而弥合了约束强化学习与实际RLHF之间的重要理论鸿沟。