We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.
翻译:我们提出了自我博弈偏好优化(SPO)算法,这是一种从人类反馈中进行强化学习的方法。该方法具有“极小化”特性:无需训练奖励模型,亦无需进行不稳定的对抗训练,因此实现起来相当简便。同时,该方法又具有“极大化”特性:它能在理论上处理非马尔可夫、非传递性和随机偏好,并对离线方法在序列预测中常见的累积误差具有鲁棒性。为实现上述特性,我们基于极小极大赢家(MW)这一来自社会选择理论文献中的偏好聚合概念构建方法,该方法将基于偏好的学习视为两个策略之间的零和博弈。通过利用该博弈的对称性,我们证明:无需采用传统的两个策略对决来计算MW,只需让单一智能体进行自我博弈,即可在保持强收敛保证的同时实现目标。在实践中,这对应于从某个策略中采样多条轨迹,请评分者或偏好模型进行比较,然后将获胜比例作为特定轨迹的奖励。我们在连续控制任务套件上的实验表明,与基于奖励模型的方法相比,我们的方法学习效率显著更高,同时对实践中聚合人类判断时经常出现的非传递性和随机偏好保持鲁棒性。