The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. While considered as a straightforward solution, reinforcement learning (RL) suffers from high variance in policy updates, which impedes efficient policy improvement. Recently, direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. Though simple to implement, DPO is derived based on the optimal policy that is not assured to be achieved in practice, which undermines its convergence to the intended solution. In this paper, we propose efficient exact optimization (EXO) of the alignment objective. We prove that EXO is guaranteed to optimize in the same direction as the RL algorithms asymptotically for arbitary parametrization of the policy, while enables efficient optimization by circumventing the complexities associated with RL algorithms. We compare our method to DPO with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data. Code is available at https://github.com/haozheji/exact-optimization.
翻译:语言模型与人类偏好的对齐对于其在现实任务中的应用至关重要。该问题被形式化为优化模型策略,以最大化反映人类偏好的预期奖励,同时保持与初始策略的最小偏差。虽然强化学习(RL)被视为直接解决方案,但其策略更新存在高方差问题,阻碍了策略的高效改进。近期提出的直接偏好优化(DPO)方法通过偏好数据直接优化策略。尽管实现简单,但DPO基于实践中未必能实现的最优策略推导而来,这削弱了其向目标解的收敛性。本文提出对齐目标的高效精确优化(EXO)方法。我们证明,对于任意参数化的策略,EXO能保证渐近地与RL算法沿相同方向优化,同时通过规避RL算法的复杂性实现高效优化。通过理论与实证分析,我们将所提方法与DPO进行对比,并进一步在真实人类偏好数据上展示其相对于现有方法的优势。代码已开源:https://github.com/haozheji/exact-optimization。