Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
翻译:奖励模型(RMs)在将大型语言模型(LLMs)与人类偏好对齐方面发挥着关键作用。然而,在工具学习领域,专门为函数调用任务设计的RMs的缺乏限制了更具能力的智能体AI的发展。我们引入了ToolRM,一个专为通用工具使用场景定制的轻量级生成式RMs系列。为了构建这些模型,我们提出了一种新颖的流水线,利用基于规则的评分和多维采样构建成对偏好数据。这产生了ToolPref-Pairwise-30K,一个多样化、平衡且具有挑战性的评判任务数据集,支持带有可验证反馈的强化学习。为了评估工具使用RMs,我们还引入了TRBench$_{BFCL}$,一个基于智能体评估套件BFCL构建的基准。在我们构建的数据上训练,Qwen3-4B/8B系列的模型实现了高达14.28%的准确率提升,在成对奖励判断中显著超越了前沿模型如Claude 4和OpenAI o3。除了训练目标外,ToolRM能泛化到更广泛的评判任务,包括Best-of-N采样和自我修正。在ACEBench上的实验突显了其有效性和效率,实现了推理时扩展,并将输出令牌使用量减少了超过66%。我们发布数据和模型检查点以促进未来研究。