OpenRubrics：面向奖励建模与大型语言模型对齐的可扩展合成评分标准生成 (OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment)

Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured criteria to capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further remove noisy rubrics via preserving preference-label consistency. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 8.4%. These gains transfer to policy models on instruction-following and biomedical benchmarks.

翻译：奖励建模是强化学习从人类反馈（RLHF）的核心，但现有大多数奖励模型依赖于标量或成对判断，无法捕捉人类偏好的多维度特性。近期研究探索了以评分标准作为奖励（RaR）的方法，利用结构化准则捕捉回答质量的多个维度。然而，生成既可靠又可扩展的评分标准仍是关键挑战。本研究提出OpenRubrics——一个用于训练评分标准生成模型及基于评分标准的奖励模型的多样化、大规模（提示，评分标准）对集合。为获取具有区分性和全面性的评估信号，我们提出对比式评分标准生成（CRG）方法，通过对比优选回答与拒绝回答，推导出硬性规则（显式约束）与原则（隐性特质）。我们进一步通过保持偏好标签一致性来去除噪声评分标准。在多个奖励建模基准测试中，我们基于评分标准的奖励模型Rubric-RM以8.4%的优势超越同规模强基线模型。这些优势可迁移至指令遵循和生物医学基准测试中的策略模型。