The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.
翻译:以生成式奖励模型为代表的基于语言模型的评判奖励建模的出现,成功使基于人工智能反馈的强化学习(RLAIF)变得高效且可扩展。为进一步推进该范式,我们提出一个核心见解:这种奖励建模形式与自然语言理解中的核心任务——自然语言推理(NLI)具有根本的形式一致性。这一重构视角指出了构建更优奖励模型的关键路径:扩展模型的理解边界。沿着这一路径,在NLI任务上的探索性实验表明,结合上下文解释的槽位预测掩码语言模型(MLMs)相比主流自回归模型实现了显著更优的性能。基于这一关键发现,我们提出ESFP-RM,一种两阶段基于语言模型的评判奖励模型,其利用基于解释的槽位预测框架以充分发挥MLMs的优势。大量实验表明,在基于人类反馈的强化学习(RLHF)和分布外(OOD)场景中,ESFP-RM框架相比生成式奖励模型能提供更稳定且更具泛化性的奖励信号。