DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.
翻译:DPO及相关算法通过直接优化RLHF目标来对齐语言模型:在通过KL散度惩罚项保持与参考策略接近的同时,寻找能最大化Bradley-Terry奖励的策略。先前研究表明该方法可进一步推广:即使将KL散度替换为具有凸生成函数$f$的$f$散度族,原问题仍可有效求解。我们的首要贡献是证明$f$的凸性并非必要条件。相反,我们提出了一个更普适的条件(称为DPO诱导条件),该条件精确刻画了RLHF问题何时保持可解性。我们的第二项贡献是建立了关于$f$的另一个必要条件,该条件对于防止概率位移现象至关重要——这是一种已知的经验现象,即获胜与失败响应的概率会趋近于零。我们将满足此条件的任意$f$称为抗位移散度。最后我们聚焦于一个同时满足DPO诱导与抗位移特性的特定$f$,由此推导出新颖的SquaredPO损失函数。相较于DPO,该新损失函数在提供更强理论保证的同时,在实践中表现出相当的竞争力。