Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is insufficient to capture the full range of general human preferences. To achieve robust alignment with general preferences, we model the alignment problem as a two-player zero-sum game, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous algorithms for finding the Nash policy either diverge or converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. Theoretically, we prove that our meta-algorithm converges to an exact Nash policy in the last iterate. Additionally, our meta-algorithm is simple and can be integrated with many existing methods designed for RLHF and preference optimization with minimal changes. Experimental results demonstrate the effectiveness of the proposed framework when combined with existing preference policy optimization methods.
翻译:许多对齐方法,包括基于人类反馈的强化学习(RLHF),都依赖于 Bradley-Terry 奖励假设,该假设不足以捕捉通用人类偏好的全部范围。为了实现与通用偏好的鲁棒对齐,我们将对齐问题建模为一个两人零和博弈,其中纳什均衡策略保证对任何竞争策略都有 50% 的胜率。然而,即使在简单的合成设置中,先前用于寻找纳什策略的算法要么发散,要么收敛到修改后博弈中的纳什策略,从而无法维持对所有其他策略的 50% 胜率保证。受博弈论中收敛算法的启发,我们提出了一种用于大语言模型与通用偏好对齐的元算法——收敛元对齐算法(COMAL)。理论上,我们证明了我们的元算法在最后一次迭代中收敛到一个精确的纳什策略。此外,我们的元算法简单,只需最小改动即可与许多为 RLHF 和偏好优化设计的现有方法集成。实验结果表明,所提出的框架与现有偏好策略优化方法结合时具有有效性。