Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study policy alignment under an unknown and unrestricted link function. We formulate an $f$-divergence-constrained reward maximization problem and show that realizability in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-induced index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than impose identifiability of structural parameters of such a model and estimate them, as in econometrics, we develop methods that directly learn policies, with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable and nonparametric indices. We prove link-agnostic convergence guarantees in terms of generic function complexity measures and validate the methods and theory empirically. Code is available at https://github.com/causalml/spo/.
翻译:策略对齐到偏好数据通常假设观测到的偏好与潜在奖励之间存在已知的连接函数(例如,Bradley-Terry模型/逻辑斯蒂连接)。这种连接的设定错误会扭曲推断的奖励并使习得的策略发生偏差。我们研究在未知且无限制连接函数下的策略对齐。我们提出了一个$f$-散度约束下的奖励最大化问题,并证明在策略类中的可实现性诱导出一个半参数单指标二元选择模型,其中标量策略诱导指标捕获了所有对示范的依赖性,而剩余的偏好分布则不受限制。我们并非像计量经济学中那样强制此类模型结构参数的可识别性并进行估计,而是开发直接学习策略的方法,其中奖励函数是隐式的,分析到最优策略的误差,并允许不可识别和非参数化的指标。我们证明了与链接无关的收敛保证,基于通用函数复杂度度量,并在经验上验证了方法和理论。代码可在 https://github.com/causalml/spo/ 获取。