While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.
翻译:尽管基于模型的验证器对于扩展可验证奖励强化学习至关重要,但当前以结果为中心的验证范式主要关注最终结果与标准答案的一致性,往往忽略推导过程中可能存在的错误。这导致对错误推导产生的正确答案分配了正向奖励。为弥补这一差距,我们提出了PRIME基准,用于评估验证器在数学与工程领域的过程-结果对齐验证能力。该基准通过从大学STEM问题库中系统筛选,经由一致性过滤流程构建了2,530个高难度样本。大规模评估表明,现有验证器经常无法检测推导缺陷。进一步,我们提出利用PRIME筛选验证器的过程感知RLVR训练范式。该方法显著优于仅结果验证基线,在Qwen3-14B-Base模型上分别于AIME24、AIME25和Beyond-AIME数据集取得8.29%、9.12%和7.31%的绝对性能提升。最后,我们验证了验证器在PRIME上的准确率与RLVR训练效果之间存在强线性相关性($R^2 > 0.92$),证明PRIME可作为验证器选择的可靠预测指标。