Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.
翻译:人类反馈强化学习(RLHF)是一种使模型与人类偏好对齐的新兴范式。通常情况下,RLHF会聚合来自多个个体的偏好,而这些个体可能持有相互冲突的多样化观点。我们的工作首次对多方RLHF进行理论研究,显式建模了多个个体的多样化偏好。我们展示了传统RLHF方法可能失效的原因——学习单一奖励函数无法捕捉并平衡多个体的偏好。为克服这些局限,我们引入元学习来学习多个偏好,并采用不同社会福利函数来聚合多方偏好。我们聚焦于离线学习场景,建立了样本复杂度界,并给出了针对纳什、功利主义和莱克西敏等多样化社会福利函数优化的效率与公平性保证。结果表明多方RLHF与传统单方RLHF的样本复杂度存在差异。此外,我们考虑了无奖励设置,其中每个个体的偏好不再与奖励模型一致,并基于离线偏好数据给出了冯·诺伊曼赢家的悲观变体。综上,我们的工作展示了多方RLHF的优势,同时也凸显了其更高的统计复杂度要求。