Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function. We formulate an $f$-divergence-constrained reward maximization problem and show that realizability in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-induced index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than impose identifiability of structural parameters of such a model and estimate them, as in econometrics, we develop methods that directly learn policies, with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable and nonparametric indices. We prove link-agnostic convergence guarantees in terms of generic function complexity measures and validate the methods and theory empirically. Code is available at https://github.com/causalml/spo/.

翻译：策略对齐到偏好数据通常假设观测到的偏好与潜在奖励之间存在已知的连接函数（例如，Bradley-Terry模型/逻辑斯蒂连接）。这种连接的设定错误会扭曲推断的奖励并使习得的策略发生偏差。我们研究在未知且无限制连接函数下的策略对齐。我们提出了一个$f$-散度约束下的奖励最大化问题，并证明在策略类中的可实现性诱导出一个半参数单指标二元选择模型，其中标量策略诱导指标捕获了所有对示范的依赖性，而剩余的偏好分布则不受限制。我们并非像计量经济学中那样强制此类模型结构参数的可识别性并进行估计，而是开发直接学习策略的方法，其中奖励函数是隐式的，分析到最优策略的误差，并允许不可识别和非参数化的指标。我们证明了与链接无关的收敛保证，基于通用函数复杂度度量，并在经验上验证了方法和理论。代码可在 https://github.com/causalml/spo/ 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

【EMNLP2025】面向大语言模型的权重旋转偏好优化

专知会员服务

12+阅读 · 2025年8月27日

【AAAI2025】偏好导向的监督微调：优先选择目标模型而非对齐的大语言模型

专知会员服务

23+阅读 · 2024年12月18日

训练扩散模型比你想象的更简单！谢赛宁老师：Representation matters！

专知会员服务

21+阅读 · 2024年10月25日