In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.
翻译:本文提出了一种简单高效的方法,用于长上下文推理轨迹上的价值模型训练。相较于现有的过程奖励模型(PRMs),我们的方法无需定义精细化的“步骤”概念——这对于长上下文推理模型而言难以明确定义。通过收集包含250万条推理轨迹的数据集,我们训练了一个15亿参数级别的词元级价值模型,并将其应用于DeepSeek系列模型,通过测试阶段计算资源扩展提升了模型性能。研究发现:采用最终加权多数表决的块级价值引导搜索(VGS),相较于多数表决或n选一等传统方法,能实现更优的测试阶段扩展效果。此外,为达到与多数表决相同的性能水平,VGS可显著降低推理所需的浮点运算量。本研究的数据库、模型及代码库均已开源。