Aligning large language models (LLMs) to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., a logistic Bradley-Terry link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study preference alignment under an unknown and unrestricted link function. We show that realizability of $f$-divergence-constrained reward maximization in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-dependent index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than assuming this model has identifiable finite-dimensional structural parameters and estimating them, as in econometrics, we focus on policy learning with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable nonparametric indices. We develop preference optimization algorithms robust to the unknown link and prove convergence guarantees in terms of generic function complexity measures. We demonstrate this empirically on LLM alignment. Code is available at https://github.com/causalml/spo/
翻译:对齐大型语言模型(LLM)与偏好数据通常假设观测偏好与潜在奖励之间存在已知的连接函数(例如逻辑型Bradley-Terry连接)。该连接函数的误设可能导致推断奖励产生偏差并使学习策略失准。本研究在连接函数未知且无约束的条件下探讨偏好对齐问题。我们证明,在策略类别中实现$f$-散度约束的奖励最大化可诱导出半参数单指标二元选择模型,其中标量化的策略依赖指标捕获了所有对演示数据的依赖性,而剩余的偏好分布则不受限制。不同于计量经济学中假设该模型具有可识别的有限维结构参数并进行估计的做法,我们聚焦于隐含奖励函数的策略学习,分析其与最优策略的误差,并允许存在不可识别的非参数指标。我们开发了对未知连接函数具有鲁棒性的偏好优化算法,并基于通用函数复杂度度量证明了收敛性保证。我们在LLM对齐任务上进行了实证验证。代码发布于 https://github.com/causalml/spo/