The performance of the reward model (RM) is a critical factor in improving the effectiveness of the large language model (LLM) during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only $60\%$ to $75\%$, causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the Double-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code and dataset are available at: https://github.com/quanshr/DMoERM-v1.
翻译:奖励模型(RM)的性能是在对齐微调过程中提升大语言模型(LLM)效果的关键因素。目前RM训练面临两大挑战:1)使用不同类型数据训练同一RM可能导致其泛化性能受多任务干扰;2)人工标注一致率通常仅为60%至75%,导致训练数据包含大量噪声。针对这两个挑战,我们首次将混合专家(MoE)思想引入RM领域,提出双层MoE RM(DMoERM)。外层MoE为稀疏模型:将输入分类至任务类别后路由至对应内层任务专用模型。内层MoE为密集模型:将特定任务分解为多个能力维度,对每个维度独立微调LoRA专家,并通过MLP合成其输出以计算最终奖励。为降低成本,我们调用公开LLM应用程序编程接口(API)获取能力偏好标签。人工标注数据集验证表明,本模型在人工偏好一致性上表现优异,超越先进生成式方法。同时,通过胜率(BoN)采样和强化学习(RL)实验,我们证明本模型优于现有最优RM集成方法,并缓解了过优化问题。代码与数据集已开源:https://github.com/quanshr/DMoERM-v1