Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.
翻译:许多对齐方法,包括基于人类反馈的强化学习(RLHF),都依赖于Bradley-Terry奖励假设,该假设并不总能充分捕捉通用人类偏好的全部范围和复杂性。我们在一个通用偏好框架下探索RLHF,通过将对齐问题建模为博弈论框架下的一个两人零和博弈,其中纳什均衡策略保证了对任何竞争策略具有50%的胜率。然而,即使在简单的合成设置中,先前用于寻找纳什策略的自博弈算法要么发散,要么仅收敛到修改后博弈中的纳什策略,从而无法维持对所有其他策略的50%胜率保证。受博弈论中收敛算法的启发,我们提出了一种用于通用偏好下语言模型对齐的元算法——收敛元对齐算法(COMAL)。我们提供了理论分析,证明我们的元算法在最后一次迭代中收敛到一个精确的纳什策略,并在一系列合成和偏好优化数据集上展示了其有效性。COMAL结构简单,只需最小改动即可与许多为偏好优化设计的现有方法集成。实证结果表明,在受控评估下,当应用于Llama-3-8B-Instruct和Qwen2.5-7B模型时,COMAL相对于所有对比算法,其胜率始终保持在60.2%和56.8%以上。