Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.
翻译:当前大型语言模型(LLM)的对齐方法依赖于将海量人类偏好数据压缩为静态的绝对奖励函数,这导致数据稀缺、噪声敏感和训练不稳定。我们提出了Elo-Evolve,这是一个协同进化框架,将对齐重新定义为自适应对手池内的动态多智能体竞争。我们的方法实现了两项关键创新:(1)通过直接从配对竞争的二元胜负结果中学习,消除了对Bradley-Terry模型的依赖;(2)实施Elo编排的对手选择,通过温度控制的采样提供自动课程学习。我们将方法建立在PAC学习理论基础上,证明配对比较实现了更优的样本复杂度,并通过实证验证了与绝对评分方法相比噪声降低了4.5倍。在实验中,我们使用包含Qwen2.5-14B、Qwen2.5-32B和Qwen3-8B模型在内的对手池,基于本框架训练了Qwen2.5-7B模型。结果表明了一个清晰的性能层次:基于点的方法 < 静态配对训练 < Elo-Evolve(在Alpaca Eval 2.0和MT-Bench上均得到验证),这证实了配对比较和动态对手选择对LLM对齐的渐进式益处。