The growing safety concerns surrounding Large Language Models (LLMs) raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, common Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, thus greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based scenarios (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness of our methods.
翻译:大型语言模型(LLMs)日益凸显的安全问题,迫切要求将其与多样化的人类偏好对齐,以同时提升其帮助性与安全性。一种有前景的途径是通过基于人类反馈的强化学习(RLHF)来强制执行安全约束。对于此类带约束的RLHF问题,常见的基于拉格朗日函数的原始-对偶策略优化方法计算成本高昂且往往不稳定。本文提出一种对偶化视角,将带约束的对齐问题简化为一个等效的无约束对齐问题。我们通过预优化一个具有闭式解的光滑凸对偶函数来实现这一点。这一捷径消除了繁琐的原始-对偶策略迭代需求,从而极大地减轻了计算负担并提升了训练稳定性。我们的策略在基于模型和基于偏好的两种场景下分别导出了两种实用算法(MoCAN与PeCAN)。广泛的实验验证了我们方法的有效性。