The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
翻译:通过强化学习(RLHF)从人类偏好中进行学习的主流部署依赖于两个重要近似:其一假设成对偏好可以被逐点奖励替代;其二假设基于这些逐点奖励训练的奖励模型能够从收集数据泛化到由策略采样的分布外数据。近期提出的直接偏好优化(DPO)方法绕过了第二个近似,可在无需奖励建模阶段的情况下直接从收集数据中学习策略。然而该方法仍严重依赖第一个近似。本文试图对这些实际算法进行更深层次的理论理解。特别地,我们推导出一个以成对偏好形式表达的新型通用目标函数$\Psi$PO,用于从人类偏好中学习,该函数同时规避了上述两个近似。这一新型通用目标函数使我们能够深入分析RLHF和DPO(作为$\Psi$PO的特殊情况)的行为特征并识别其潜在缺陷。随后我们考虑$\Psi$PO的另一种特殊情况——将$\Psi$简单设为恒等映射,由此推导出高效的优化流程、证明其性能保证,并在若干示例中展示了其相较于DPO的实证优越性。