Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context

翻译：实际上，基于人类反馈的偏好学习依赖包含隐藏上下文的不完整数据。隐藏上下文指影响接收到的反馈但未在用于训练偏好模型的数据中体现的信息。这涵盖了数据收集中的常见问题，例如：人类标注者存在多元偏好、导致看似非理性行为的认知过程、以及根据不同标准标注的数据组合。我们证明，标准偏好学习应用（包括基于人类反馈的强化学习，RLHF）会依据一种称为波达计数的著名投票规则隐式聚合隐藏上下文。研究表明，这会产生反直觉的结果，与通过期望效用隐式聚合的其他方法存在显著差异。此外，我们的分析形式化了从具有多元价值的用户中学习偏好如何隐含实现社会选择函数。该结果的关键启示在于：标注者存在通过虚报偏好影响学习模型的激励，从而导致RLHF部署中的漏洞。为缓解这些问题，我们提出一类称为分布偏好学习（DPL）的方法。DPL方法通过估计每个候选项可能得分的分布，从而更有效地处理隐藏上下文。实验结果表明，将DPL应用于大语言模型对话机器人的RLHF能识别数据中的隐藏上下文，并显著降低后续越狱攻击脆弱性。我们的代码与数据见https://github.com/cassidylaidlaw/hidden-context。