PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Yuheng Ji,Yuyang Liu,Huajie Tan,Xuchuan Huang,Fanding Huang,Yijie Xu,Cheng Chi,Yuting Zhao,Huaihai Lyu,Peterson Co,Mingyu Cao,Qiongyu Zhang,Zhe Li,Enshen Zhou,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Xiaolong Zheng

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.

翻译：当前机器人评估仍主要被二元成功率所主导，这种评估方式将丰富的执行过程压缩为单一结果，掩盖了进展、效率与稳定性等关键特性。为解决这一局限，我们提出PRM-as-a-Judge——一种密集评估范式，利用过程奖励模型（PRMs）通过轨迹视频直接审计策略执行，从观测序列中估计任务进展。该范式的核心是OPD（结果-过程-诊断）指标系统，通过任务对齐的进展势能明确形式化执行质量。我们通过两大公理化性质刻画密集机器人评估：宏观一致性要求加性且路径一致的聚合方式，而微观分辨率则要求对细粒度物理演化保持敏感性。在此框架下，基于势能的PRM评判器天然实现了密集评估，其宏观一致性直接源于诱导的标量势能。我们利用专为探测微观尺度进展判别能力而设计的诊断基准RoboPulse，实证验证了微观分辨率性质——多个经轨迹训练的PRM评判器优于基于判别相似性的方法及通用基础模型评判器。最后，借助PRM-as-a-Judge与OPD指标系统，我们针对长时域任务对主流策略范式进行了结构化审计，揭示出仅依赖结果指标无法观测的行为特征与失效模式。