Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context

翻译：实践中，基于人类反馈的偏好学习依赖于包含隐藏上下文的不完整数据。隐藏上下文指影响所接收反馈、但未体现在用于训练偏好模型的数据中的信息。这涵盖了数据收集中的常见问题，例如不同偏好的标注者、导致看似非理性行为的认知过程，以及根据不同标准标注的数据组合。我们证明，标准偏好学习应用（包括基于人类反馈的强化学习）会隐式地根据一种名为波达计数的著名投票规则聚合隐藏上下文。我们表明，这会产生与通过期望效用隐式聚合的其他方法截然不同的反直觉结果。此外，我们的分析形式化了具有多元价值观的用户偏好学习如何默默实现社会选择函数。这一结果的关键含义是，标注者有动机误报其偏好以影响学习模型，从而导致RLHF部署中的脆弱性。为缓解这些问题，我们引入了一类称为分布偏好学习（DPL）的方法。DPL方法估计每个备选方案可能得分的分布，以更好地考虑隐藏上下文。实验结果表明，将DPL应用于大语言模型聊天机器人的RLHF，能识别数据中的隐藏上下文并显著降低后续越狱漏洞。我们的代码和数据见https://github.com/cassidylaidlaw/hidden-context。