Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.
翻译:奖励引导的搜索方法通过有效指导在复杂动作空间上的采样与探索,已展现出增强工具使用智能体的强大潜力。作为核心设计,这些搜索方法利用过程奖励模型(PRMs)提供步骤级奖励,从而实现更细粒度的监控。然而,当前缺乏针对工具使用场景下PRMs的系统性、可靠评估基准。本文中,我们介绍了ToolPRMBench,一个专门用于评估工具使用智能体PRMs的大规模基准。ToolPRMBench构建于多个代表性工具使用基准之上,并将智能体轨迹转换为步骤级测试用例。每个用例包含交互历史、一个正确动作、一个看似合理但不正确的替代动作,以及相关的工具元数据。我们分别利用离线采样来隔离局部单步错误,并利用在线采样从完整的智能体运行轨迹中捕获真实的多步失败情况。我们提出了一个多LLM验证流程以减少标签噪声并确保数据质量。我们在ToolPRMBench上对大型语言模型、通用PRMs以及工具专用PRMs进行了广泛实验。结果揭示了PRMs在有效性上的显著差异,并突显了专用PRMs在工具使用方面的潜力。代码与数据将在 https://github.com/David-Li0406/ToolPRMBench 发布。