Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBench$_{BFCL}$, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.
翻译:奖励模型(RMs)在将大语言模型(LLMs)与人类偏好对齐方面发挥着关键作用。然而,在工具学习领域,专门为函数调用任务设计的奖励模型的缺乏,限制了更具能力的智能体人工智能的发展。我们提出了ToolRM,一个专为通用工具使用场景定制的轻量级奖励模型系列。为构建这些模型,我们提出了一种新颖的流水线,利用基于规则的评分和多维度采样来构建高质量的成对偏好数据。这产生了ToolPref-Pairwise-30K,一个多样化、平衡且具有挑战性的偏好数据集,支持生成式和判别式奖励建模。我们还引入了TRBench$_{BFCL}$,这是一个基于智能体评估套件BFCL构建的基准,用于评估工具调用任务上的奖励模型。在我们构建的数据上训练的Qwen3-4B/8B系列模型,其准确率最高提升了17.94%,在成对奖励判断中显著优于前沿的大语言模型和奖励模型。除了训练目标外,生成式ToolRM能够泛化到更广泛的评判任务,包括Best-of-N采样和自我修正。在ACEBench上的实验突显了其有效性和效率,能够在推理时进行扩展,同时将输出令牌使用量减少超过66%。其对下游强化学习训练的支持进一步验证了其实用价值。我们发布数据以促进未来研究。