Current RLHF frameworks for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the scalability of alignment and generalizability of models. To address this, we introduce a general open-ended RLHF framework that casts alignment as an asymmetric game between two players: (i) a creator that generates increasingly informative prompt distributions using reward signals, and (ii) a solver that learns to produce more preferred responses on prompts produced by the creator. This framework of Evolving Alignment via Asymmetric Self-Play (eva), results in a simple and efficient approach that can utilize any existing RLHF algorithm for scalable alignment. eva outperforms state-of-the-art methods on widely-used benchmarks, without the need of any additional human crafted prompts. Specifically, eva improves the win rate of Gemma-2-9B-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching claude-3-opus. This improvement is persistent even when new human crafted prompts are introduced. Finally, we show eva is effective and robust under various ablation settings.
翻译:当前用于对齐大语言模型(LLM)的RLHF框架通常假设固定的提示分布,这种假设存在次优性,并限制了对齐的可扩展性与模型的泛化能力。为解决此问题,我们提出了一种通用的开放式RLHF框架,将对齐问题构建为两个参与者之间的非对称博弈:(i)创造者利用奖励信号生成信息量逐渐增加的提示分布;(ii)求解者学习在创造者生成的提示上产生更受偏好的响应。这种通过非对称自博弈实现演化对齐(eva)的框架,形成了一种简洁高效的方法,能够利用任何现有RLHF算法实现可扩展的对齐。eva在广泛使用的基准测试中超越了现有最优方法,且无需任何额外的人工设计提示。具体而言,eva将Gemma-2-9B-it在Arena-Hard上的胜率从51.6%提升至60.1%(使用DPO时),从55.7%提升至58.9%(使用SPPO时),从52.3%提升至60.7%(使用SimPO时),以及从54.8%提升至60.3%(使用ORPO时),超越了其27B版本并与claude-3-opus性能相当。即使引入新的人工设计提示,这种改进依然持续存在。最后,我们通过多种消融实验证明eva具有高效性和鲁棒性。