The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $\alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush-Kuhn-Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable mapping between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, $f$-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE).
翻译:大型语言模型(LLMs)能力的提升为通用人工智能带来了机遇,但同时也加剧了安全担忧(例如AI系统的潜在滥用),因此需要有效的AI对齐。基于人类反馈的强化学习(RLHF)已成为实现AI对齐的重要路径,但其复杂性和对独立奖励模型的依赖带来了挑战。直接偏好优化(DPO)被提出作为替代方案,且在反向KL正则化约束下与RLHF等价。本文提出$f$-DPO,一种通过引入多样化散度约束来泛化DPO的方法。我们证明,在特定$f$-散度(包括Jensen-Shannon散度、前向KL散度和$\alpha$-散度)下,通过求解Karush-Kuhn-Tucker条件,可以简化奖励与最优策略之间的复杂关系。这消除了Bradley-Terry模型中归一化常数的估计需求,并实现了奖励函数与最优策略之间易于处理的映射。我们的方法可在更广泛的散度约束下,以更高效且监督的方式优化LLMs以对齐人类偏好。实验表明,采用这些散度能在对齐性能与生成多样性之间取得平衡。重要的是,$f$-DPO在散度效率上优于基于PPO的方法,且散度约束直接影响期望校准误差(ECE)。