Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.
翻译:确保人工智能模型与人类价值观对齐对其安全性和功能性至关重要。基于人类反馈的强化学习利用人类偏好来实现这种对齐。然而,当偏好来源于多样化群体时,奖励的点估计可能导致次优性能或对特定群体不公平。我们提出帕累托最优偏好学习方法,该方法通过将不同群体的偏好差异视为存在潜在权衡的目标,以实现多元化对齐,旨在获得在偏好数据集上帕累托最优的策略。POPL采用词典选择法,这是一种选择多样化且帕累托最优解的迭代过程。我们的理论和实证评估表明,POPL在学习奖励函数集和策略方面优于基线方法,能在无需获取群体数量或成员标签的情况下有效满足不同群体的需求。我们在无状态偏好学习环境、Minigrid强化学习领域、Metaworld机器人基准测试以及大语言模型微调中验证了POPL的性能。我们进一步说明,POPL也可作为优化特定群体公平性技术的基础框架,从而确保人工智能模型对齐的安全性与公平性。