Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiway comparisons and top-$k$ rankings. We introduce Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility-based and rank-based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provides principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show that RCPO consistently outperforms competitive baselines. RCPO shows that directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers an extensible foundation for incorporating (ranked) choice modeling into LLM training.
翻译:大语言模型(LLM)的对齐主要依赖于成对偏好优化,即标注者从提示词的两个回复中选择更优者。该方法虽简单,却忽视了从更丰富的人类反馈形式(如多方比较和 top-$k$ 排序)中学习的机会。我们提出了排序选择偏好优化(RCPO),这是一个通过最大似然估计将偏好优化与(排序)选择模型相统一的框架。RCPO 同时支持基于效用和基于排序的模型,将多种成对方法(如 DPO 和 SimPO)作为特例包含在内,并为更丰富的反馈格式提供了原则性的训练目标。我们使用两个代表性模型(多项 Logit 模型和 Mallows-RMJ 模型)实例化了该框架。在 Llama-3-8B-Instruct、Gemma-2-9B-it 和 Mistral-7B-Instruct 模型上,针对分布内和分布外设置的实验表明,RCPO 始终优于竞争基线。RCPO 证明,直接利用排序偏好数据并结合适当的选择模型,能够实现更有效的对齐。它为将(排序)选择模型融入 LLM 训练提供了一个可扩展的基础。