Mult-DPO: Multinomial Direct Preference Optimization for Recommender Systems

Direct preference optimization (DPO) is a simple and effective alignment strategy for large language models (LLMs) based on pairwise preferences. In recommender systems, however, user feedback is rarely pairwise. For a given context, e.g., a user, a session, or a conversation, we typically observe set-wise preferences with multiple positive items, where every positive item should outrank every unobserved or explicitly negative item, with no prescribed order among the positives or the negatives themselves. A natural generalization is to use the Plackett-Luce (PL) reward model, which extends the Bradley-Terry reward model underlying vanilla DPO from pairwise preferences to full rankings of candidates. However, we show that adapting the PL model to set-wise preferences requires marginalizing over all positive orderings, where the resulting expression is combinatorial in complexity. To address this fundamental challenge, we propose Mult-DPO, a novel DPO objective with a tractable multinomial surrogate likelihood over set-wise preference events for the user-preference alignment of LLM-based recommender systems. The multinomial construction is not itself a ranking distribution, but it is defined on the same reward-induced weight space and admits a closed-form DPO-style objective, enabling direct alignment of LLMs with multiple candidates through a classification-style objective. In addition, we prove that the multinomial DPO loss is a tractable upper bound on the marginalized PL DPO loss when optimizing against the set-wise preference data. We further characterize the tightness of this bound in terms of the relative total weight of positives versus negatives, which provides insights into tightening the bound with richer or harder negatives. Finally, we extend Mult-DPO to the alignment of LLMs with multiple preference levels. Code is available at https://github.com/yaochenzhu/Mult_DPO

翻译：直接偏好优化（DPO）是一种基于成对偏好、对大型语言模型（LLM）进行对齐的简单且有效的策略。然而在推荐系统中，用户反馈很少是成对的。对于给定情境（例如用户、会话或对话），我们通常观察到包含多个正向项目的集合级偏好，其中每个正向项目应优于所有未观测到或明确负向的项目，而正向项目之间或负向项目之间本身没有规定顺序。一种自然的泛化方法是使用Plackett-Luce（PL）奖励模型，该模型将原始DPO所依赖的Bradley-Terry奖励模型从成对偏好扩展为候选者的完整排序。然而，我们表明，将PL模型适配到集合级偏好需要对所有正向排列进行边缘化，由此得到的表达式在复杂度上具有组合爆炸性。针对这一根本性挑战，我们提出Mult-DPO，这是一种新颖的DPO目标函数，其在集合级偏好事件上采用可计算的多项式替代似然，用于基于LLM的推荐系统中的用户偏好对齐。多项式构造本身并非排序分布，但它定义在相同的奖励诱导权重空间上，并具有闭式DPO风格的目标函数，从而能够通过分类风格的目标函数直接对齐LLM与多个候选项目。此外，我们证明，在优化集合级偏好数据时，多项式DPO损失是对边缘化PL DPO损失的可计算上界。我们进一步刻画了该界在正向与负向项目相对总权重方面的紧致性，这为通过更丰富或更难的负向项目来收紧该界提供了洞见。最后，我们将Mult-DPO扩展到具有多个偏好等级的LLM对齐。代码可在https:// github.com/yaochenzhu/Mult_DPO获取。