Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.
翻译:基于人类反馈的强化学习通常使用不区分个体的奖励模型来建模偏好。我们认为,在存在高度分歧可能性的场景中(例如大规模语言模型训练),这种设计选择可能并不理想。我们提出了一种将奖励模型适配到特定个体或人群的方法。该方法基于以下观察:个体偏好可表示为一系列通用奖励特征的线性组合。我们展示了如何学习此类特征,并利用它们快速将奖励模型适配到特定个体,即使其偏好未在训练数据中体现。我们通过大规模语言模型实验,将所提出的架构与非自适应奖励模型及自适应模型(包括上下文个性化模型)进行比较。根据训练数据中的分歧程度,我们的模型要么显著优于基线模型,要么以更简洁的架构和更稳定的训练达到同等性能。