The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. While considered as a straightforward solution, reinforcement learning (RL) suffers from high variance in policy updates, which impedes efficient policy improvement. Recently, direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. Though simple to implement, DPO is derived based on the optimal policy that is not assured to be achieved in practice, which undermines its convergence to the intended solution. In this paper, we propose efficient exact optimization (EXO) of the alignment objective. We prove that EXO is guaranteed to optimize in the same direction as the RL algorithms asymptotically for arbitary parametrization of the policy, while enables efficient optimization by circumventing the complexities associated with RL algorithms. We compare our method to DPO with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data.
翻译:语言模型与人类偏好的对齐对其在实际任务中的应用至关重要。该问题被形式化为优化模型策略,以最大化反映人类偏好的期望奖励,同时最小化与初始策略的偏离。尽管强化学习(RL)被视为直接解决方案,但其策略更新存在高方差问题,这阻碍了策略的高效改进。近期,直接偏好优化(DPO)被提出,可直接从偏好数据优化策略。虽然实现简单,但DPO基于最优策略推导,而该最优策略在实践中无法保证实现,从而削弱了其收敛到预期解的能力。本文提出对齐目标的高效精确优化(EXO)方法。我们证明,对于策略的任意参数化,EXO均可保证渐进地与RL算法沿相同方向优化,同时通过规避RL算法的复杂性实现高效优化。我们通过理论与实证分析将方法对比DPO,并在真实人类偏好数据上进一步展示本方法相较于现有方法的优势。