Reinforcement learning from human feedback (RLHF) implicitly aggregates heterogeneous human preferences into a single utility function, even though the underlying utilities of the participants are in practice diverse. Hence, RLHF can be viewed as a form of voting, where the aggregation mechanism is defined by the loss function. Although Arrow's Impossibility Theorem suggests that different mechanisms satisfy different sets of desirable axioms, most existing methods rely on a single aggregation principle, typically the Bradley-Terry-Luce (BTL) model, which corresponds to Borda count voting. This restricts the axiomatic properties of the learned reward and obscures the normative assumptions embedded in optimization. In this work, we introduce Differential Voting, a unifying framework that constructs instance-wise, differentiable loss functions whose population-level optima provably correspond to distinct classical voting rules. We develop differentiable surrogates for majority-based aggregation (BTL), Copeland, and Kemeny rules, and formally analyze their calibration properties, gradient fields, and limiting behavior as smoothing parameters vanish. For each loss, we establish consistency with the corresponding social choice rule and characterize the axioms it satisfies or violates. Our analysis shows how design choices in loss geometry-such as margin sensitivity and boundary concentration-directly translate into normative aggregation behavior. Differential Voting makes preference aggregation an explicit and controllable design choice in RLHF, enabling principled trade-offs between axiomatic guarantees and optimization stability. Code to reproduce our experiments is open-sourced.
翻译:基于人类反馈的强化学习(RLHF)将异构的人类偏好隐式聚合为单一效用函数,尽管参与者的底层效用实际上具有多样性。因此,RLHF可被视为一种投票形式,其聚合机制由损失函数定义。尽管阿罗不可能定理表明不同机制满足不同的理想公理集合,但现有方法大多依赖单一聚合原则(通常为Bradley-Terry-Luce模型,对应博尔达计数投票法),这限制了所学奖励函数的公理性质,并模糊了优化过程中嵌入的规范性假设。本文提出差分投票——一个统一框架,通过构建实例级可微损失函数,其总体最优解可证明对应于不同的经典投票规则。我们为基于多数的聚合(BTL)、科普兰规则和凯梅尼规则开发了可微代理函数,并形式化分析了它们的校准特性、梯度场以及平滑参数趋零时的极限行为。针对每种损失函数,我们证明了其与相应社会选择规则的一致性,并刻画了其满足或违反的公理。分析表明,损失函数几何结构的设计选择(如边界敏感性和边界集中度)可直接转化为规范性聚合行为。差分投票使偏好聚合成为RLHF中显式且可控的设计选项,从而在公理保证与优化稳定性之间实现原则性权衡。实验复现代码已开源。