Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
翻译:奖励建模对于将大型语言模型(LLM)与人类偏好对齐至关重要,尤其是在基于人类反馈的强化学习(RLHF)中。然而,当前的奖励模型主要产生标量分数,难以纳入自然语言格式的批评。我们假设同时预测批评和标量奖励将提升奖励建模能力。受此启发,我们提出了Critic-RM框架,该框架利用自生成的批评来改进奖励模型,无需额外监督。Critic-RM采用两阶段流程:首先生成并筛选高质量批评,随后对奖励预测与批评生成进行联合微调。跨基准测试的实验表明,与标准奖励模型和LLM评判器相比,Critic-RM将奖励建模准确率提升了3.7%-7.3%,展现出卓越的性能和数据效率。进一步研究证实了生成批评在修正错误推理步骤方面的有效性,可将推理准确率提升2.5%-3.2%。