This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.
翻译:本文提出了一种新颖的框架——熵驱动统一过程奖励模型(EDU-PRM),该框架在逼近过程监督领域最先进性能的同时,大幅降低了训练成本。EDU-PRM引入了一种基于熵引导的动态步骤划分机制,利用对数概率分布熵动态定位令牌生成过程中的高不确定性区域。这种自评估能力使得模型能够在无需人工细粒度标注的情况下提供精确的步骤级反馈,从而解决了过程监督中的一个关键挑战。在Qwen2.5-72B模型上进行的实验表明,仅使用7,500个由EDU-PRM生成的训练查询,其准确率即可接近完整的Qwen2.5-72B-PRM(71.1% 对比 71.6%),同时相较于先前方法实现了98%的查询成本降低。本工作确立了EDU-PRM作为一种可扩展的过程奖励模型训练的高效方法。