Capturing Individual Human Preferences with Reward Features

Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.

翻译：基于人类反馈的强化学习通常使用不区分个体的奖励函数来建模偏好。我们认为，在存在高度分歧可能性的场景（例如大型语言模型的训练）中，这很可能不是一个理想的设计选择。我们形式化并分析了学习可针对特定用户进行定制的奖励模型的问题。基于经验风险最小化原则，我们推导出一个近似正确概率（PAC）界，该界限如常规情况所示依赖于训练样本数量，同时也依赖于提供反馈的人类评分者数量。基于理论发现，我们探讨了如何最优地收集成对偏好数据，并论证当用户间存在显著分歧时，自适应奖励模型应具有优势。我们还提出了一种自适应奖励模型的具体架构。该方法利用了个体偏好可表示为通用奖励特征集合的线性组合这一观测结果。我们展示了如何学习此类特征，并随后利用它们快速将奖励模型适配到特定个体，即使其偏好未在训练数据中体现。我们通过大型语言模型实验展示了理论结果，并将所提架构与非自适应基线进行比较。与理论分析一致，我们模型带来的优势随评分者数量及其偏好异质性的增加而提升。实验还表明，我们的模型相较于包括执行上下文个性化在内的其他自适应模型具有更优性能。