The Perfect Blend: Redefining RLHF with Mixture of Judges

Tengyu Xu,Eryk Helenowski,Karthik Abinav Sankararaman,Di Jin,Kaiyan Peng,Eric Han,Shaoliang Nie,Chen Zhu,Hejia Zhang,Wenxuan Zhou,Zhouhao Zeng,Yun He,Karishma Mandyam,Arya Talabzadeh,Madian Khabsa,Gabriel Cohen,Yuandong Tian,Hao Ma,Sinong Wang,Han Fang

from arxiv, submitted to conference

Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. This is often done via human intuition and does not generalize. In this work, we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy optimization with stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong empirical results with theoretical guarantees, does not require extensive hyper-parameter tuning, and is plug-and-play in common post-training pipelines. Together, this can detect and mitigate reward hacking behaviors while reaching a pareto-optimal point across an extremely large number of objectives. Our empirical evaluations demonstrate that CGPO significantly outperforms standard RLHF algorithms like PPO and DPO across various tasks including general chat, STEM questions, instruction following, and coding. Specifically, CGPO shows improvements of 7.4% in AlpacaEval-2 (general chat), 12.5% in Arena-Hard (STEM & reasoning), and consistent gains in other domains like math and coding. Notably, PPO, while commonly used, is prone to severe reward hacking in popular coding benchmarks, which CGPO successfully addresses. This breakthrough in RLHF not only tackles reward hacking and extreme multi-objective optimization challenges but also advances the state-of-the-art in aligning general-purpose LLMs for diverse applications.

翻译：基于人类反馈的强化学习（RLHF）已成为微调大语言模型（LLM）的主流方法。然而，由于奖励破解和极端多目标优化（即多个且有时相互冲突的目标之间的权衡）的挑战，RLHF在多任务学习（MTL）中存在局限性。目前将RLHF应用于MTL需要仔细调整奖励模型和数据组合的权重，这通常依赖于人工直觉且难以泛化。本文提出了一种新颖的后训练范式，称为约束生成策略优化（CGPO）。CGPO的核心是混合评判器（MoJ），结合了具有分层机制的成本效益型约束策略优化，能够以原则性方式识别RLHF中的完美融合。该方法展现出强大的实证结果与理论保证，无需大量超参数调优，并可即插即用于常见的后训练流程。整体上，它能够检测并缓解奖励破解行为，同时在极大量目标上达到帕累托最优。我们的实证评估表明，在通用对话、STEM问题、指令遵循及代码生成等多种任务上，CGPO显著优于PPO和DPO等标准RLHF算法。具体而言，CGPO在AlpacaEval-2（通用对话）上提升7.4%，在Arena-Hard（STEM与推理）上提升12.5%，并在数学与代码生成等其他领域持续取得增益。值得注意的是，广泛使用的PPO在主流代码基准测试中易出现严重奖励破解问题，而CGPO成功解决了这一难题。这一RLHF领域的突破不仅应对了奖励破解与极端多目标优化的挑战，更为通用大语言模型在多领域应用的对齐技术实现了前沿进展。