Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.
翻译:奖励模型(RMs)为大语言模型后训练(特别是强化微调RFT和强化学习RL流程)提供关键反馈信号。然而,当前奖励评估依赖规则验证器、真实参考答案、程序化检查清单和复杂评分细则等异构标准,尚未有统一机制整合所有类型的证据。为此,我们提出技能奖励模型(Skill-RM)——将奖励建模重构为可复用“奖励评估技能”执行的统一框架。通过将奖励计算视为结构化智能体任务,Skill-RM提供了协调异构资源的一致接口,能够动态选择并聚合适合每个输入特定要求的证据。该方法使奖励模型突破静态评估局限,确保跨多样任务的一致性与透明度。在奖励基准测试及下游应用(包括最优N选和强化学习)上的大量实验表明,Skill-RM始终优于传统裁判基线。我们的发现表明,Skill-RM不仅提供了奖励建模的统一解决方案,还通过战略性动态编排证据实现了更优性能。代码已在https://github.com/Qwen-Applications/Skill-RM开源。