Re-inforcement learning from human feedback (RLHF) has been effective in the task of AI alignment. However, one of the key assumptions of RLHF is that the annotators (referred to as workers from here on out) have a homogeneous response space. This assumption is not true in most practical settings and there have been studies done in the past to challenge this notion. This work has been inspired by such studies and explores one of the ways to deal with heterogeneity in worker preferences - by clustering workers with similar preferences and personalising reward models for each cluster. This work provides an algorithm that encourages simultaneous learning of reward models and worker embeddings. This algorithm is then empirically tested against the Reddit TL;DR dataset with unique worker IDs. We have shown that clustering users into different groups based on their preferences and created personalised reward models improves win-rate of the said models. Along with results and visualisations, this work aims to act as a stepping stone to more complicated models and gives a list of possible future extensions.
翻译:基于人类反馈的强化学习(RLHF)在人工智能对齐任务中已被证明是有效的。然而,RLHF的一个关键假设是标注者(下文统称为工作者)具有同质的响应空间。这一假设在大多数实际场景中并不成立,过去已有研究对此提出挑战。本研究受此类研究启发,探索了处理工作者偏好异质性的一种方法——通过聚类具有相似偏好的工作者并为每个聚类个性化奖励模型。本文提出了一种鼓励同时学习奖励模型与工作者嵌入的算法。该算法在具有唯一工作者ID的Reddit TL;DR数据集上进行了实证测试。研究表明,根据用户偏好将其聚类为不同组别并创建个性化奖励模型,能够提升所述模型的胜率。除结果与可视化展示外,本研究旨在为更复杂的模型奠定基础,并提供了一系列可能的未来扩展方向。