Self-Generated Critiques Boost Reward Modeling for Language Models

Yue Yu,Zhengxing Chen,Aston Zhang,Liang Tan,Chenguang Zhu,Richard Yuanzhe Pang,Yundi Qian,Xuewei Wang,Suchin Gururangan,Chao Zhang,Melanie Kambadur,Dhruv Mahajan,Rui Hou

from arxiv, 20 pages

Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.

翻译：奖励建模对于将大型语言模型（LLM）与人类偏好对齐至关重要，尤其是在基于人类反馈的强化学习（RLHF）中。然而，当前的奖励模型主要产生标量分数，难以纳入自然语言格式的批评。我们假设同时预测批评和标量奖励将提升奖励建模能力。受此启发，我们提出了Critic-RM框架，该框架利用自生成的批评来改进奖励模型，无需额外监督。Critic-RM采用两阶段流程：首先生成并筛选高质量批评，随后对奖励预测与批评生成进行联合微调。跨基准测试的实验表明，与标准奖励模型和LLM评判器相比，Critic-RM将奖励建模准确率提升了3.7%-7.3%，展现出卓越的性能和数据效率。进一步研究证实了生成批评在修正错误推理步骤方面的有效性，可将推理准确率提升2.5%-3.2%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/