The Representation-Rationalizability Tradeoff in Reward Learning

In RLHF, each training example contains a prompt $x$ and two candidate responses $y,y'$, and annotators provide pairwise preferences between these responses. The learning problem is to convert these heterogeneous pairwise judgments into a single scalar reward $r(x,y)$ that measures response quality for each prompt. Classical social choice implies an impossibility because heterogeneous annotator samples can induce pooled preferences with Condorcet cycles, so no scalar reward can evaluate all compared response pairs consistently. A growing literature analyzes RLHF as a social-choice problem, but usually assumes a fixed finite set of alternatives, i.e., a pre-enumerated finite set of candidate responses for each prompt. Modern pipelines instead score responses through a learned representation $φ(x,y)$ before a scalar head, so $φ$ determines which responses are treated as distinguishable alternatives and which comparisons are visible to the reward model. Once this embedding is part of the problem, the impossibility results from social choice theory become a tradeoff. We show that the excess cross-entropy loss of any reward built on $φ$ decomposes exactly into a representational term, which a richer $φ$ shrinks, and an aggregation term, which a richer $φ$ enlarges by exposing more comparisons that no scalar can rank consistently. The same results extend to direct preference optimization (DPO), and jointly training the embedding and the reward cannot guarantee to recover the sweet spot of this tradeoff. Experiments on synthetic data and real preference datasets corroborate our results.

翻译：在基于人类反馈的强化学习（RLHF）中，每个训练样本包含一个提示 $x$ 和两个候选回复 $y, y'$，标注者对这些回复提供成对偏好。学习问题是将这些异质成对判断转化为单一标量奖励 $r(x,y)$，以衡量每个提示下回复的质量。经典社会选择理论表明存在不可行性，因为异质标注者样本可能诱导出具有康多塞循环的聚合偏好，因此没有任何标量奖励能一致地评估所有被比较的回复对。越来越多的文献将RLHF作为社会选择问题进行分析，但通常假设存在固定有限备选集合，即每个提示的候选回复是预先枚举的有限集。现代流程则通过学习表征 $φ(x,y)$ 再经标量头对回复评分，因此 $φ$ 决定了哪些回复被视为可区分的备选对象，以及哪些比较对奖励模型可见。一旦嵌入成为问题的一部分，社会选择理论中的不可行性结果便转化为一种权衡。我们证明，基于 $φ$ 构建的任意奖励的额外交叉熵损失可精确分解为两项：一项是表征项，随 $φ$ 丰富性增加而减小；另一项是聚合项，随 $φ$ 丰富性增加而增大，因为更丰富的 $φ$ 会暴露更多无法被任何标量一致排序的比较对。相同结论可推广至直接偏好优化（DPO），且联合训练嵌入与奖励无法保证恢复该权衡的最佳平衡点。在合成数据和真实偏好数据集上的实验验证了我们的结果。