A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.
翻译:对齐语言模型与人类偏好的常见方法是首先从偏好数据中学习一个奖励模型,然后利用该奖励模型更新语言模型。我们研究了这种方法中出现的两个紧密相关问题。首先,任何单调变换都能保持奖励模型的偏好排序;是否存在一种“优于”其他选择的变换?其次,我们通常希望将语言模型与多个属性对齐:应如何组合多个奖励模型?通过对齐过程的概率解释,我们为(常见情况下的)从Bradley-Terry偏好模型学习到的奖励确定了一种自然的变换选择。这种推导出的变换具有两个重要特性。第一,它强调改进表现较差的输出,而非已经得分较高的输出。这既缓解了欠拟合(某些提示未被改进)问题,也缓解了奖励黑客攻击(模型学会利用奖励模型的错误规范)问题。第二,它通过将求和与逻辑合取联系起来实现了奖励的原则性聚合:变换后奖励的总和对应于输出在所有测量属性上“良好”的概率——我们在精确意义上阐明了这一点。使用RLHF将语言模型同时对齐为“有用”且“无害”的实验表明,该变换相比基线(未变换)方法有显著改进。