Pre-Trained Policy Discriminators are General Reward Models

Shihan Dou,Shichun Liu,Yuming Yang,Yicheng Zou,Yunhua Zhou,Shuhao Xing,Chenhao Huang,Qiming Ge,Demin Song,Haijun Lv,Songyang Gao,Chengqi Lv,Enyu Zhou,Honglin Guo,Zhiheng Xi,Wenwei Zhang,Qipeng Guo,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Tao Gui,Kai Chen

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

翻译：本文提出了一种奖励建模的新视角：将其形式化为策略判别器，通过量化两个策略之间的差异来生成奖励信号，从而引导训练策略向具有期望行为的目标策略靠拢。基于这一概念洞见，我们提出了一种可扩展的预训练方法——策略判别学习（POLAR），该方法训练奖励模型（RM）以识别相同策略并区分不同策略。与传统依赖绝对偏好的奖励建模方法不同，POLAR能够捕捉一个策略与任意目标策略之间的相对差异，这是一个适用于建模通用排序关系的可扩展高层优化目标。利用POLAR预训练范式，我们提出了一系列参数规模从18亿到70亿的奖励模型。实验结果表明，POLAR显著优于传统的非预训练方法，大幅提升了奖励模型的性能。例如，相较于最先进的基线模型，POLAR-7B在STEM任务上的偏好准确率从54.8%提升至81.0%，在创意写作任务上从57.9%提升至85.5%。POLAR在使用强化微调（RFT）的RLHF中也展现出强大的泛化能力，能够提供可靠的奖励信号并显著提升策略性能——在20个基准测试中，将LLaMa3.1-8B的平均性能从47.36%提升至56.33%，将Qwen2.5-32B从64.49%提升至70.47%。此外，扩展实验揭示了计算量与性能之间明显的幂律关系，其线性相关系数接近0.99。出色的性能、强大的泛化能力及扩展特性表明，POLAR是开发通用且强大奖励模型的一个有前景的方向。