Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
翻译:过程奖励模型(PRMs)在静态领域(如数学)中增强大语言模型(LLMs)推理能力方面已取得显著成功。然而,它们在动态数据分析任务中的潜力尚未得到充分探索。本文首先通过实证研究发现,通用领域的PRMs难以有效监督数据分析智能体:具体而言,它们无法检测静默错误(即导致错误结果但未触发解释器异常的逻辑缺陷),且会错误地将必要的试错探索行为视为接地失败而加以惩罚。为解决这一问题,我们提出DataPRM——一种新型环境感知生成式过程奖励模型,具备两大核心能力:(1)作为主动验证器,自主与环境交互以探测中间执行状态并发现静默错误;(2)采用基于反思的三元奖励策略,区分可纠正的接地错误与不可恢复的失误。我们设计了一套可扩展的数据流水线,通过多样性驱动的轨迹生成与知识增强的步骤级标注,构建了超过8000个高质量训练实例。实验结果表明,在Best-of-N推理策略下,DataPRM使下游策略LLM在ScienceAgentBench和DABStep上分别提升7.21%和11.28%。值得注意的是,仅40亿参数的DataPRM即超越多个强基线模型,并在多种测试时缩放策略中展现出稳健的泛化能力。此外,将DataPRM集成至强化学习框架后,其在DABench和TableBench上分别达到78.73%和64.84%的显著性能提升,优于基于结果奖励的基线方法,验证了过程奖励监督的有效性。代码已开源:https://github.com/zjunlp/DataMind。