We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a preference or teacher model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.
翻译:我们提出了自博弈偏好优化(SPO)算法,这是一种基于人类反馈的强化学习方法。我们的方法在最小化方面:无需训练奖励模型,也不涉及不稳定的对抗训练,因此实现相当简单。在最大化方面:该方法可证明地处理非马尔可夫、非传递性和随机偏好,同时对困扰离线序列预测方法的复合误差具有鲁棒性。为实现上述特性,我们基于社会选择理论文献中的偏好聚合概念——最小最大胜者(MW),将偏好学习问题构建为两个策略之间的零和博弈。通过利用该博弈的对称性,我们证明无需采用传统双策略对决来计算MW,只需让单个智能体与自身博弈即可保持强收敛保证。实际操作中,这对应于从策略中采样多条轨迹,要求偏好或教师模型对它们进行比较,然后将胜率作为特定轨迹的奖励。我们在连续控制任务套件上的实验表明,相比基于奖励模型的方法,我们的方法学习效率显著更高,同时对实践中聚合人类判断时频繁出现的非传递性和随机偏好保持鲁棒性。