Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
翻译:自动评估对于开放式自然语言生成至关重要,但在基于规则的度量方法不可行时仍极具挑战。与传统方法相比,近期兴起的“大语言模型即评委”范式能够实现更优且更灵活的评估,并展现出作为强化学习中生成式奖励模型的潜力。然而,先前研究揭示了此类模型在基准测试中看似优异的性能与其在强化学习实践中的实际效果之间存在显著差距。我们将此问题归因于现有研究的若干局限性,包括以成对评估为主导的范式以及评估准则优化不足。为此,我们提出CE-RM-4B——一个采用专用两阶段生成方法训练、并采用基于查询的统一评估准则的逐点生成式奖励模型。仅使用从开源偏好数据集中精选的约5.7K高质量数据,我们的CE-RM-4B在多样化的奖励模型基准测试中(尤其在Best-of-N场景下)取得了卓越性能,并在下游强化学习实践中实现了更有效的改进。