Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.
翻译:基于人类反馈的强化学习(RLHF)通过从偏好数据中导出的单一奖励模型来对齐语言模型与人类偏好。然而,这种方法忽略了源自多用户数据中人类偏好的丰富多样性。本文首先推导出单一奖励RLHF对齐的不可能性结论,从而揭示其在表征多样人类偏好方面的不足。为提供公平的解决方案,我们通过期望最大化算法学习偏好分布的混合模型,并基于社会选择理论中的平等主义原则提出MaxMin对齐目标函数用于策略学习,以更好表征多样人类偏好。我们阐明了所提方法与分布鲁棒优化及通用效用RL的关联,突出解决方案的通用性和鲁棒性。我们在小规模(GPT-2)和大规模语言模型(Tulu2-7B)上进行了全面实验,结果表明该方法在人类偏好多样性存在时的有效性。与常规RLHF算法相比,我们的算法在胜率上平均提升超过16%,且在不牺牲多数群体性能的前提下,将少数群体的胜率(准确率)提升超过33%,展示了方法的鲁棒性和公平性。需要指出,本文的研究结论不仅适用于语言模型,还可推广至通用强化学习领域。