We study the problem of computing an optimal large language model (LLM) policy for the constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF and Anthropic HH-RLHF datasets.
翻译:本文研究了在约束对齐问题中计算最优大语言模型(LLM)策略的问题,其目标是在满足次要效用约束的条件下最大化主要奖励目标。尽管基于拉格朗日方法的LLM策略搜索在约束对齐中应用广泛,但迭代的原始-对偶方法常难以收敛,而非迭代的对偶方法在LLM参数空间中无法达到最优性。为解决这些挑战,我们利用拉格朗日对偶理论,提出了一种迭代的对偶对齐方法,该方法通过在拉格朗日最大化中更新LLM策略与在对偶下降中更新对偶变量之间交替进行。在理论上,我们刻画了分布空间中的原始值与LLM参数空间中对偶值之间的原始-对偶间隙。进一步地,我们量化了在接近最优对偶变量下学习的LLM策略在目标函数和约束函数方面的最优性差距。这些结果证明了对偶对齐方法能够找到一个最优的约束LLM策略,直至一个LLM参数化间隙。通过在PKU-SafeRLHF和Anthropic HH-RLHF数据集上进行的大量实验,我们验证了所提方法的有效性和优势。