Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging weights.This design retains the robustness and exploitation resistance provided by expert diversity while avoiding the inference overhead of MoE architectures or explicit ensembles. Experiments across multiple base models and preference datasets show that, compared with standard dense RMs, UMM-RM improves accuracy on preference data, reduces reward hacking during PPO training, and yields more stable preference alignment.
翻译:奖励模型(RMs)是基于人类反馈的强化学习(RLHF)的关键组成部分。然而,传统的稠密奖励模型容易受到策略模型通过偏见或伪相关性的利用,导致奖励黑客攻击:在训练过程中奖励模型分数上升,而与人类偏好的一致性却下降,这一问题在分布偏移下会进一步加剧。为解决此问题,我们提出了UMM-RM(升级与合并混合专家奖励模型)。UMM-RM首先将稠密骨干网络的前馈层升级为具有共享专家的混合专家(MoE)奖励模型。共享专家始终保持激活以捕捉与指令无关的偏好信号,而其余专家则对不同指令或任务体系下的细粒度偏好进行建模。训练完成后,专家们通过可学习的合并权重整合为单一的稠密奖励模型。该设计保留了专家多样性所提供的鲁棒性和抗利用性,同时避免了MoE架构或显式集成带来的推理开销。在多个基础模型和偏好数据集上的实验表明,与标准稠密奖励模型相比,UMM-RM提高了在偏好数据上的准确性,减少了PPO训练期间的奖励黑客攻击,并实现了更稳定的偏好对齐。